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Abstract 

o 

C^l Determinantal point processes (DPPs) are elegant probabilistic models of repulsion 

1-H that arise in quantum physics and random matrix theory. In contrast to traditional 

I ^ structured models like Markov random fields, which become intractable and hard to 

approximate in the presence of negative correlations, DPPs offer efficient and exact 
algorithms for sampling, marginalization, conditioning, and other inference tasks. We 
provide a gentle introduction to DPPs, focusing on the intuitions, algorithms, and 
r— I extensions that are most relevant to the machine learning community, and show how 

DPPs can be applied to real-world applications like finding diverse sets of high-quality 
^> search results, building informative summaries by selecting diverse sentences from 

^ t documents, modeling non-overlapping human poses in images or video, and automatically 

building timelines of important news stories. 

^ 1 Introduction 

> 

fT) Probabilistic modeling and learning techniques have become indispensable tools for analyzing 

^ data, discovering patterns, and making predictions in a variety of real-world settings. In 

recent years, the widespread availability of both data and processing capacity has led to 
new applications and methods involving more complex, structured output spaces, where the 
^ goal is to simultaneously make a large number of interrelated decisions. Unfortunately, the 

CN introduction of structure typically involves a combinatorial explosion of output possibilities, 

making inference computationally impractical without further assumptions. 

A popular compromise is to employ graphical models, which are tractable when the 
graph encoding local interactions between variables is a tree. For loopy graphs, inference 
can often be approximated in the special case when the interactions between variables are 
positive and neighboring nodes tend to have the same labels. However, dealing with global, 
negative interactions in graphical models remains intractable, and heuristic methods often 
fail in practice. 

Determinantal point processes (DPPs) offer a promising and complementary approach. 
Arising in quantum physics and random matrix theory, DPPs are elegant probabilistic models 
of global, negative correlations, and offer efficient algorithms for sampling, marginalization, 
conditioning, and other inference tasks. While they have been studied extensively by 
mathematicians, with a deep and beautiful theory, DPPs are relatively new in machine 
learning. We aim to provide a comprehensible introduction to DPPs, focusing on the 
intuitions, algorithms, and extensions that are most relevant to our community. 
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Figure 1: Diversity is used to generate a set of summary timelines describing the most 
important events from a large news corpus. 



1.1 Diversity 

A DPP is a distribution over subsets of a fixed ground set, for instance, sets of search results 
selected from a large database. Equivalently, a DPP over a ground set of N items can be 
seen as modeling a binary characteristic vector of length N. The essential characteristic of a 
DPP is that these binary variables are negatively correlated; that is, the inclusion of one item 
makes the inclusion of other items less likely. The strengths of these negative correlations 
are derived from a kernel matrix that defines a global measure of similarity between pairs 
of items, so that more similar items are less likely to co-occur. As a result, DPPs assign 
higher probability to sets of items that are diverse; for example, a DPP will prefer search 
results that cover multiple distinct aspects of a user's query, rather than focusing on the 
most popular or salient one. 

This general concept of diversity can take on a variety of forms depending on context 
and application. Including multiple kinds of search results might be seen as covering or 
summarizing relevant interpretations of the query or its associated topics; see Figure [TJ 
Alternatively, items inhabiting a continuous space may exhibit diversity as a result of 
repulsion^ as in Figure [2j In fact, certain repulsive quantum particles are known to be 
precisely described by a DPP; however, a DPP can also serve as a model for general repulsive 
phenomena, such as the locations of trees in a forest, which appear diverse due to physical 
and resource constraints. Finally, diversity can be used as a filtering prior when multiple 
selections must be based on a single detector or scoring metric. For instance, in Figure |3] 
a weak pose detector favors large clusters of poses that are nearly identical, but filtering 
through a DPP ensures that the final predictions are well-separated. 



1.2 Outline 

In this paper we present general mathematical background on DPPs along with a range of 
modeling extensions, efficient algorithms, and theoretical results that aim to enable practical 
modeling and learning. The material is organized as follows. 



Section [2| |Determinantal point processed . We begin with an introduction to deter- 



minantal point processes tailored to the interests of the machine learning community. We 
focus on discrete DPPs, emphasizing intuitions and including new, simplified proofs for some 



2 



Figure 2: On the left, points are sampled randomly; on the right, repulsion between points 
leads to the selection of a diverse set of locations. 




Figure 3: On the left, the output of a human pose detector is noisy and uncertain; on the 
right, applying diversity as a filter leads to a clean, separated set of predictions. 



theoretical results. We provide descriptions of known efficient inference algorithms, and 
characterize their computational properties. 



Section [3f |Representation and algorithms , We describe a decomposition of the DPP 



that makes explicit its fundamental tradeoff between quality and diversity. We compare the 
expressiveness of DPPs and MRFs, characterizing the tradeoffs in terms of modeling power 
and computational efficiency. We also introduce a dual representation for DPPs, showing 
how it can be used to perform efficient inference over large ground sets. When the data 
are high-dimensional and dual inference is still too slow, we show that random projections 
can be used to maintain a provably close approximation to the original model while greatly 
reducing computational requirements. 

Section [4[ |Learning[ We derive an efficient algorithm for learning the parameters of a 
quality model when the diversity model is held fixed. We employ this learning algorithm to 
perform extractive summarization of news text. 



Section [5f fc-DPPs] , We present an extension of DPPs that allows for explicit control 



over the number of items selected by the model. We show not only that this extension solves 
an important practical problem, but also that it increases expressiveness: a fc-DPP can 
capture distributions that a standard DPP cannot. The extension to fc-DPPs necessitates 
new algorithms for efficient inference based on recursions for the elementary symmetric 
polynomials. We validate the new model experimentally on an image search task. 
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Section |6f [Structured DPPs , We extend DPPs to model diverse sets of structured items, 
such as sequences or trees, where there are combinatorially many possible configurations. 
In this setting the number of possible subsets is doubly-exponential, presenting a daunting 
computational challenge. However, we show that a factorization of the quality and diversity 
models together with the dual representation for DPPs makes efficient inference possible 
using second-order message passing. We demonstrate structured DPPs on a toy geographical 
paths problem, a still-image multiple pose estimation task, and two high-dimensional text 
threading tasks. 



2 Determinantal point processes 



Determinantal point processes (DPPs) were first identified as a class by Macchi |1975j , who 
called them "fermion processes" because they give the distributions of fermion systems 
at thermal equilibrium. The Pauli exclusion principle states that no two fermions can 
occupy the same quantum state; as a consequence fermions exhibit what is known as the 
"anti-bunching" effect. This repulsiveness is described precisely by a DPP. 

In fact, years before Macchi gave them a general treatment, speciffc DPPs appeared 
in major results in random matrix theory [Mehta and Gaudin 



Ginibre, 1965 , where they continue to play an important role Diaconis 



1960 



Dyson 


1962a|b|c 




2003, 


Johansson 



2005b . Recently, DPPs have attracted a flurry of attention in the mathematics community 



Borodin and Olshanski, 2000, 


Borodin and Soshnikov, 


2003, 


Borodin and Rains, 2005 


Borodin et al. I |2010l [Burton anc 


[ Pemantlel|1993l|Johanssonl 20021 2004, 2005a, Okounkov 



2001 



Okounkov and Reshetikhin 



2003 



Shirai and Takahashi , 2000] , and much progress has 



been made in understanding their formal combinatorial and probabilistic properties. The 
term "determinantal" was first used by Borodin and Olshanski [2000J , and has since become 



accepted as standard. Many good mathematical surveys are now available Borodin, 2009 



Hough et al.j [20061 Piirai and Takahashi| |2003a|bl |Lyons| [20031 |Soshniko\^ |2000| |Tao| [2009 



We begin with an overview of the aspects of DPPs most relevant to the machine learning 
community, emphasizing intuitions, algorithms, and computational properties. 



2.1 Definition 

A point process P on a ground set 3^ is a probability measure over "point patterns" or "point 
configurations" of 3^, which are finite subsets of y. For instance, 3^ could be a continuous time 
interval during which a scientist records the output of a brain electrode, with V{{yi, y2, ys}) 
characterizing the likelihood of seeing neural spikes at times ?/2, and ys- Depending on 
the experiment, the spikes might tend to cluster together, or they might occur independently, 
or they might tend to spread out in time. V captures these correlations. 

For the remainder of this paper, we will focus on discrete, finite point processes, where 
we assume without loss of generality that 3^ = {1, 2, ... , N}; in this setting we sometimes 
refer to elements of 3^ as items. Much of our discussion extends to the continuous case, but 
the discrete setting is computationally simpler and often more appropriate for real-world 
data — e.g., in practice, the electrode voltage will only be sampled at discrete intervals. The 
distinction will become even more apparent when we apply our methods to y with no natural 
continuous interpretation, such as the set of documents in a corpus. 
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In the discrete point process is simply a probability measure on 2^, the set of ah 

subsets of 3^. A sample from V might be the empty set, the entirety of 3^, or anything in 
between. V is called a determinantal point process if, when 1^ is a random subset drawn 
according to 7^, we have, for every A cy^ 

V{A <ZY) = det{KA) (1) 

for some real, symmetric N x N matrix K indexed by the elements of Here, Ka = 
[Kij].j^^ denotes the restriction of K to the entries indexed by elements of A, and we adopt 
det(i^0) = 1. Note that normalization is unnecessary here, since we are defining marginal 
probabilities that need not sum to 1. 

Since P is a probability measure, all principal minors det(A^^) of K must be nonnegative, 
and thus K itself must be positive semidefinite. It is possible to show in the same way that 



the eigenvalues of K are bounded above by one using Equation (27), which we introduce 
later. These requirements turn out to be sufficient: any ^ K ^ I, defines a DPP. This 
will be a consequence of Theorem |2.3[ 

We refer to K as the marginal kernel since it contains all the information needed to 
compute the probability of any subset A being included in 1^. A few simple observations 
follow from Equation Q. If A = {i} is a singleton, then we have 

V{i eY) = Kii . (2) 

That is, the diagonal of K gives the marginal probabilities of inclusion for individual elements 
of y. Diagonal entries close to 1 correspond to elements of y that are almost always selected 
by the DPP. Furthermore, if A = {i^j} is a two-element set, then 



V{i,jeY)^ 



(3) 

KuKjj - KijKji (4) 

-2 



V{ieY)V{jeY)-Kf^. (5) 



Thus, the off-diagonal elements determine the negative correlations between pairs of elements: 
large values of Kij imply that i and j tend not to co-occur. 

Equation ^ demonstrates why DPPs are "diversifying" . If we think of the entries of 
the marginal kernel as measurements of similarity between pairs of elements in y, then 
highly similar elements are unlikely to appear together. If Kij = ^jKuKjj^ then i and j 
are "perfectly similar" and do not appear together almost surely. Conversely, when K is 
diagonal there are no correlations and the elements appear independently. Note that DPPs 
cannot represent distributions where elements are more likely to co-occur than if they were 
independent: correlations are always nonpositive. 

Figure [4] shows the difference between sampling a set of points in the plane using a 
DPP (with Kij inversely related to the distance between points i and j), which leads to 
a relatively uniformly spread set with good coverage, and sampling points independently, 
which results in random clumping. 



^In general, K need not be symmetric. However, in the interest of simplicity, we proceed with this 
assumption; it is not a significant limitation for our purposes. 
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DPP Independent 



Figure 4: A set of points in the plane drawn from a DPP (left), and the same number of 
points sampled independently using a Poisson point process (right). 



2.1.1 Examples 

In this paper, our focus is on using DPPs to model real-world data. However, many 
theoretical point processes turn out to be exactly determinantal, which is one of the main 
reasons they have received so much recent attention. In this section we briefly describe a few 
examples; some of them are quite remarkable on their own, and as a whole they offer some 
intuition about the types of distributions that are realizable by DPPs. Technical details for 
each example can be found in the accompanying reference. 



Descents in random sequences [Borodin et al. , 2010] Given a sequence of random 



numbers drawn uniformly and independently from a finite set (say, the digits 0-9), the 
locations in the sequence where the current number is less than the previous number form 
a subset of {2, 3, . . . , A^}. This subset is distributed as a determinantal point process. 
Intuitively, if the current number is less than the previous number, it is probably not too 
large, thus it becomes less likely that the next number will be smaller yet. In this sense, the 
positions of decreases repel one another. 



Non-intersecting random walks [Johansson , 2004] Consider a set of k independent. 



simple, symmetric random walks of length T on the integers. That is, each walk is a 
sequence xi,X2, . . . where Xi — x^+i is either -1 or +1 with equal probability. If we 
let the walks begin at positions x\^x\^ . . . and condition on the fact that they end at 
positions x^, . . . , Xj. and do not intersect, then the positions x]^, x^, . . . , at any time t 
are a subset of Z and distributed according to a DPP. Intuitively, if the random walks do 
not intersect, then at any time step they are likely to be far apart. 

Edges in random spanning trees [Burton and Pemantle[ [199 3 Let G be an arbitrary 
finite graph with N edges, and let T be a random spanning tree chosen uniformly from the 
set of all the spanning trees of G. The edges in T form a subset of the edges of G that is 
distributed as a DPP. The marginal kernel in this case is the transfer-impedance matrix, 
whose entry A^eie2 is the expected signed number of traversals of edge 62 when a random 
walk begins at one endpoint of ei and ends at the other (the graph edges are first oriented 
arbitrarily). Thus, edges that are in some sense "nearby" in G are similar according to 
and as a result less likely to participate in a single uniformly chosen spanning tree. As this 
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Figure 5: Aztec diamonds. 
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example demonstrates, some DPPs assign zero probability to sets whose cardinality is not 
equal to a particular k; in this case, k is the number of nodes in the graph minus one — the 
number of edges in any spanning tree. We will return to this issue in Section [5j 



Eigenvalues of random matrices [Ginibre , 1965, Mehta and Gaudin, 1960| Let M be 

a random matrix obtained by drawing every entry independently from the complex normal 
distribution. This is the complex Ginibre ensemble. The eigenvalues of M, which form 
a finite subset of the complex plane, are distributed according to a DPP. If a Hermitian 
matrix is generated in the corresponding way, drawing each diagonal entry from the normal 
distribution and each pair of off-diagonal entries from the complex normal distribution, then 
we obtain the Gaussian unitary ensemble, and the eigenvalues are now a DPP-distributed 
subset of the real line. 



Aztec diamond tilings [Johansson , 2Q05a| The Aztec diamond is a diamond-shaped 



union of lattice squares, as depicted in Figure 5a, (Half of the squares have been colored 
gray in a checkerboard pattern.) A domino tiling is a perfect cover of the Aztec diamond 
using 2x1 rectangles, as in Figure |5bj Suppose that we draw a tiling uniformly at random 
from among all possible tilings. (The number of tilings is known to be exponential in the 
width of the diamond.) We can identify this tiling with the subset of the squares that are 
(a) painted gray in the checkerboard pattern and (b) covered by the left half of a horizontal 
tile or the bottom half of a vertical tile (see Figure [5c]). This subset is distributed as a DPP. 

2.2 L-ensembles 

For the purposes of modeling real data, it is useful to slightly restrict the class of DPPs 
by focusing on L-ensembles. First introduced by Borodin and Rains [2005| , an L-ensemble 
defines a DPP not through the marginal kernel but through a real, symmetric matrix L 
indexed by the elements of y-. 



p^(Y = y) ocdet(Ly) 



(6) 



Whereas Equation ([T]) gave the marginal probabilities of inclusion for subsets A, Equation ^ 
directly specifies the atomic probabilities for every possible instantiation of Y. As for it 
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is easy to see that L must be positive semidefinite. However, since Equation ^ is only a 
statement of proportionality, the eigenvalues of L need not be less than one; any positive 
semidefinite L defines an L-ensemble. The required normalization constant can be given in 
closed form due to the fact that X^ycj; det(Ly) = det(L + /), where / is the N x N identity 
matrix. This is a special case of the following theorem. 

Theorem 2.1. For any A cy, 

J2 det(Ly) = det(L + /^), (7) 

ACYCy 

where is the diagonal matrix with ones in the diagonal positions corresponding to elements 
of A — y — A, and zeros everywhere else. 

Proof. Suppose that A^y^ then Equation ^ holds trivially. Now suppose inductively that 
the theorem holds whenever A has cardinality less than k. Given A such that |A| = A: > 0, 
let i be an element of y where i ^ A. Splitting blockwise according to the partition 
3^ = {z} U 3^ — {z}, we can write 

L + I^^( , f^J \ , (8) 

where Lj^ is the subcolumn of the ith column of L whose rows correspond to i, and similarly 
for L^j. By multilinearity of the determinant, then, 



det(L + 1^) 



^ii Ly-{i} + h-{i}-A 



+ 



1 

-^ii Ly-{i} + h-{i}-A 



(9) 



= det(L + + det(Ly_{i} + • (10) 

We can now apply the inductive hypothesis separately to each term, giving 

det(L + /^) = det(Ly) + ^ det(Ly) (11) 

AU{i}CYCy ACYCy-{i} 

= J2 det(Ly), (12) 

ACYCy 

where we observe that every Y either contains i and is included only in the first sum, or 
does not contain i and is included only in the second sum. □ 

Thus we have 

As a shorthand, we will write Vl{Y) instead of Vl{Y = Y) when the meaning is clear. 

We can write a version of Equation ^ for L-ensembles, showing that if L is a measure 
of similarity then diversity is preferred: 

VL{{iJ})^VL{{i})VL{{j}) - ( det(L + /) )' • (14) 

In this case we are reasoning about the full contents of Y rather than its marginals, but the 
intuition is essentially the same. Furthermore, we have the following result of iMacchi, ^1975] . 
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Theorem 2.2. An L-ensemble is a DPP, and its marginal kernel is 

K^L{L + I)-^^I-{L + I)-\ 



(15) 



Proof. Using Theorem 2.1, the marginal probabihty of a set A is 

J2ACYcy det(Ly) 



VUA CY)^ 



T^Ycy det(Ly) 
det(L + /^) 



det(L + 1) 
= det ((L + I^)(L + I)-i) . 

We can use the fact that L{L + = I — {L + to simphfy and obtain 

Vl{A CY) = det (I^(L + /)-! + / - (L + /)-!) 
= det (I - Ia(L + /)-^) 
= det (/^ + /^i^) , 



(16) 

(17) 
(18) 



(19) 
(20) 
(21) 



where we let K = I — {L + . Now, we observe that left multiplication by I a zeros out 
all the rows of a matrix except those corresponding to A. Therefore we can split blockwise 
using the partition y = AuAto get 



det (I^ + IaK) 



'\A\x\A\ 



^AA 
= det {Ka) 





Ka 



(22) 
(23) 

□ 



Note that K can be computed from an eigendecomposition of L = 'Yl,n=i ^nVnvJi by a 
simple rescaling of eigenvalues: 



N 



An 



n=l 



An + 1 



(24) 



Conversely, we can ask when a DPP with marginal kernel K is also an L-ensemble. By 
inverting Equation (15), we have 

L = K{I - K)-\ (25) 
and again the computation can be performed by eigendecomposition. However, while the 



inverse in Equation (15) always exists due to the positive coefficient on the identity matrix, 



the inverse in Equation (25) may not. In particular, when any eigenvalue of K achieves the 



upper bound of 1, the DPP is not an L-ensemble. We will see later that the existence of the 



inverse in Equation (25) is equivalent to V giving nonzero probability to the empty set. (This 
is somewhat analogous to the positive probability assumption in the Hammersley- Clifford 
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theorem for Markov random fields.) This is not a major restriction, for two reasons. First, 
when modeling real data we must typically allocate some nonzero probability for rare or 
noisy events, so when cardinality is one of the aspects we wish to model, the condition is not 
unreasonable. Second, we will show in Section [S] how to control the cardinality of samples 
drawn from the DPP, thus sidestepping the representational limits of L-ensembles. 

Modulo the restriction described above, K and L offer alternative representations of 
DPPs. Under both representations, subsets that have higher diversity, as measured by the 
corresponding kernel, have higher likelihood. However, while K gives marginal probabilities, 
L-ensembles directly model the atomic probabilities of observing each subset of 3^, which 
offers an appealing target for optimization. Furthermore, L need only be positive semidefinite, 
while the eigenvalues of K are bounded above. For these reasons we will focus our modeling 
efforts on DPPs represented as L-ensembles. 



2.2.1 Geometry 

Determinants have an intuitive geometric interpretation. Let S be a I? x matrix such 
that L = B^B. (Such a B can always be found for D < N when L is positive semidefinite.) 
Denote the columns of S by for z = 1, 2, . . . , A^. Then: 

Vl{Y) (X det(Ly) = Yo\^{{Bi}i^Y) , (26) 

where the right hand side is the squared |y |-dimensional volume of the parallelepiped 
spanned by the columns of B corresponding to elements in Y. 

Intuitively, we can think of the columns of B as feature vectors describing the elements 
of 3^. Then the kernel L measures similarity using dot products between feature vectors, 



and Equation (26) says that the probability assigned by a DPP to a set Y is related to the 
volume spanned by its associated feature vectors. This is illustrated in Figure |6j 

From this intuition we can verify several important DPP properties. Diverse sets are 
more probable because their feature vectors are more orthogonal, and hence span larger 
volumes. Items with parallel feature vectors are selected together with probability zero, 
since their feature vectors define a degenerate parallelepiped. All else being equal, items 
with large-magnitude feature vectors are more likely to appear, because they multiply the 
spanned volumes for sets containing them. 



We will revisit these intuitions in Section 3.1, where we decompose the kernel L so as to 



separately model the direction and magnitude of the vectors Bi. 



2.3 Properties 

In this section we review several useful properties of DPPs. 



Restriction If Y is distributed as a DPP with marginal kernel then Y n where 
A C is also distributed as a DPP, with marginal kernel Ka- 



Complement If Y is distributed as a DPP with marginal kernel then 3^ — 1^ is also 
distributed as a DPP, with marginal kernel K — I — K . In particular, we have 

V{A n Y = 0) = det{KA) = det(/ - Ka) , (27) 
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(b) (c) 

Figure 6: A geometric view of DPPs: each vector corresponds to an element of 3^. (a) The 
probabihty of a subset Y is the square of the volume spanned by its associated feature 
vectors, (b) As the magnitude of an item's feature vector increases, so do the probabilities of 
sets containing that item, (c) As the similarity between two items increases, the probabilities 
of sets containing both of them decrease. 

where / indicates the identity matrix of appropriate size. It may seem counterintuitive that 
the complement of a diversifying process should also encourage diversity. However, it is easy 
to see that 

V{i,j ^Y)^l-V{ieY)- V{j eY) + Vii,j e Y) (28) 
<l~V{i eY) - V{j eY) + V{i G Y)V{j G Y) (29) 
= V{i ^Y) + V{j tY)-l + {l-r{it Y)){1 - V{j t Y)) (30) 
= V{i^Y)V{j^Y), (31) 

where the inequality follows from Equation 

Domination li K ^ K\ that is, — K is positive semidefinite, then for all A C y we 

have 

det{KA) < det{K'j^) . (32) 

In other words, the DPP defined by is larger than the one defined by K in the sense that 
it assigns higher marginal probabilities to every set A. An analogous result fails to hold for 
L due to the normalization constant. 

Scaling If K = ^K^ for some < 7 < 1, then for all A C y we have 

det(KA) = t'^I det(K;i) . (33) 

It is easy to see that K defines the distribution obtained by taking a random set distributed 
according to the DPP with marginal kernel K\ and then independently deleting each of its 
elements with probability 1 — 7. 



11 




10 15 20 

X (L eigenvalue) 



25 



30 



Figure 7: The mapping between eigenvalues of L and K. 



Cardinality Let Ai, A2, . . . , Aat be the eigenvalues of L. Then is distributed as the 
number of successes in Bernoulli trials, where trial n succeeds with probability 
This fact follows from Theorem 2^, which we prove in the next section. One immediate 
consequence is that cannot be larger than rank(L). More generally, the expected 
cardinality of Y is 

E[|l^l] = EY^=tr(i^). (34) 

n=l 



+ 1 



and the variance is 



Var(lrl) = 



N 

E 

n 



+ 1 



(35) 



Note that, by Equation (15|), 
plot of the function /(A) 



Ai+l' 
A 



are the eigenvalues of K. Figure [7| shows a 



A2+I' • • • ' Xn- 

. It is easy to see from this why the class of L-ensembles does 
not include DPPs where the empty set has probability zero — at least one of the Bernoulli 
trials would need to always succeed, and in turn one or more of the eigenvalues of L would 
be infinite. 

In some instances, the sum of BernouUis may be an appropriate model for uncertain 
cardinality in real- world data, for instance when identifying objects in images where the 
number of objects is unknown in advance. In other situations, it may be more practical to 
fix the cardinality of Y up front, for instance when a set of exactly ten search results is 
desired, or to replace the sum of Bernoullis with an alternative cardinality model. We show 
how these goals can be can be achieved in Section [5| 



2.4 Inference 

One of the primary advantages of DPPs is that, although the number of possible realizations 
of Y is exponential in many types of inference can be performed in polynomial time. In 
this section we review the inference questions that can (and cannot) be answered efficiently. 
We also discuss the empirical practicality of the associated computations and algorithms, 
estimating the largest values of N that can be handled at interactive speeds (within 2-3 
seconds) as well as under more generous time and memory constraints. The reference 
machine used for estimating real-world performance has eight Intel Xeon E5450 3Ghz cores 
and 32GB of memory. 
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2.4.1 Normalization 

As we have already seen, the partition function, despite being a sum over 2^ terms, can be 
written in closed form as det(L + /). Determinants of N x N matrices can be computed 
through matrix decomposition in 0{N^) time, or reduced to matrix multiplication for better 
asymptotic performance. The Coppersmith- Winograd algorithm, for example, can be used 
to compute determinants in about 0(A^^'^^^) time. Going forward, we will use cj to denote 
the exponent of whatever matrix multiplication algorithm is used. 

Practically speaking, modern computers can calculate determinants up to ^ 5,000 at 
interactive speeds, or up to ^ 40,000 in about five minutes. When A^ grows much larger, 
the memory required simply to store the matrix becomes limiting. (Sparse storage of larger 
matrices is possible, but computing determinants remains prohibitively expensive unless the 
level of sparsity is extreme.) 



2.4.2 Marginalization 

The marginal probability of any set of items A can be computed using the marginal kernel 
as in Equation Q. From Equation ( [27| we can also compute the marginal probability that 
none of the elements in A appear. (We will see below how marginal probabilities of arbitrary 
configurations can be computed using conditional DPPs.) 

If the DPP is specified as an L-ensemble, then the computational bottleneck for marginal- 
ization is the computation of K. The dominant operation is the matrix inversion, which 
requires at least 0{N^) time by reduction to multiplication, or 0{N^) using Gauss- Jordan 
elimination or various matrix decompositions, such as the eigendecomposition method pro- 



posed in Section |2.2[ Since an eigendecomposition of the kernel will be central to sampling, 
the latter approach is often the most practical when working with DPPs. 

Matrices up to A^ ^ 2,000 can be inverted at interactive speeds, and problems up to 
A^ ^ 20,000 can be completed in about ten minutes. 



2.4.3 Conditioning 

The distribution obtained by conditioning a DPP on the event that none of the elements in 
a set A appear is easy to compute. For B cy not intersecting with A we have 

P.(V = B|^nV = «) = ^|i^^ (36) 



J2B':B'r\A=9 det(LB: 

det(LB) 
det(L^ + /) ' 



(38) 



where is the restriction of L to the rows and columns indexed by elements in y — A. In 
other words, the conditional distribution (over subsets of y — A) is itself a DPP, and its 
kernel is obtained by simply dropping the rows and columns of L that correspond to 
elements in A. 
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We can also condition a DPP on the event that all of the elements in a set A are observed. 
For B not intersecting with A we have 

PUY = AuBlACY) = Ii^^ZA^ (39) 

det(L^uB) ^^Q^ 



det(LAuB) 



det(L + I^) ' ^^^^ 

where is the matrix with ones in the diagonal entries indexed by elements of y — A and 
zeros everywhere else. Though it is not immediately obvious, [Borodin and Rains f2005] 



showed that this conditional distribution (over subsets of y — A) is again a DPP, with a 
kernel given by 

L^={[{L + Ia)-^]^)-'-I. (42) 

(Following the N x N inversion, the matrix is restricted to rows and columns indexed by 
elements in y — A^ then inverted again.) It is easy to show that the inverses exist if and 
only if the probability of A appearing is nonzero. 



Combining Equation (38) and Equation (41), we can write the conditional DPP given 



an arbitrary combination of appearing and non-appearing elements: 

Vl{Y ^A'^'UB I C Y, A^^* n Y = 0) = det(L^,. ) ^^^^ 

det(l7^out + -t^inj 

The corresponding kernel is 

Lfut = ([(L^out +I^in)-i]^,)"' -/. (44) 
Thus, the class of DPPs is closed under most natural conditioning operations. 

General marginals These formulas also allow us to compute arbitrary marginals. For 



example, applying Equation (15) to Equation (42) yields the marginal kernel for the 
conditional DPP given the appearance of A: 

K^ = I-[{L + I^)-']^. (45) 

Thus we have 

V{B CY\ACY) = det(K^) . (46) 
(Note that K"^ is indexed by elements of y — A^ so this is only defined when A and B are 



disjoint.) Using Equation (27) for the complement of a DPP, we can now compute the 
marginal probability of any partial assignment, i.e., 

V(A C S n Y = 0) = V(A C Y)V{B n Y = 0|A C Y) (47) 

= det{KA) det(/ - K^) . (48) 

Computing conditional DPP kernels in general is asymptotically as expensive as the 
dominant matrix inversion, although in some cases (conditioning only on non-appearance), 
the inversion is not necessary. In any case, conditioning is at most a small constant factor 
more expensive than marginalization. 
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Algorithm 1 Sampling from a DPP 



Input: eigendecomposition {{vn, )^n)}n=i ^ 

forn = 1,2, . . . ,7V do 

J ^ J U {n} with prob. 
end for 

V ^ {VnjneJ 

while \V\ > do 

Select i from y with Pr(z) = J2vevi^^ ^i)'^ 

Y ^YUi 

V ^ an orthonormal basis for the subspace of V orthogonal to 
end while 

Output: Y 



2.4.4 Sampling 



Algorithm [T| due to Hough et al. [2006| , gives an efficient algorithm for sampling a config- 



uration Y from a DPP. The input to the algorithm is an eigendecomposition of the DPP 
kernel L. Note that is the zth standard basis A^-vector, which is all zeros except for a one 
in the ith position. We will prove the following theorem. 

Theorem 2.3. Let L = J2n=i ^n'^n'^n be an orthonormal eigendecomposition of a positive 
semidefinite matrix L. Then Algorithm^ samples Y ^ Vl- 

Algorithm [l] has two main loops, corresponding to two phases of sampling. In the first 
phase, a subset of the eigenvectors is selected at random, where the probability of selecting 
each eigenvector depends on its associated eigenvalue. In the second phase, a sample Y is 
produced based on the selected vectors. Note that on each iteration of the second loop, the 
cardinality of Y increases by one and the dimension of V is reduced by one. Since the initial 



dimension of V is simply the number of selected eigenvectors (| J|), Theorem 2.3 has the 
previously stated corollary that the cardinality of a random sample is distributed as a sum 
of Bernoulli variables. 



To prove Theorem 2.3 we will first show that a DPP can be expressed as a mixture of 
simpler, elementary DPPs. We will then show that the first phase chooses an elementary 
DPP according to its mixing coefficient, while the second phase samples from the elementary 
DPP chosen in phase one. 

Definition 2.4. A DPP is called elementary if every eigenvalue of its marginal kernel is 
in {0, 1}. We write , where V is a set of orthonormal vectors, to denote an elementary 
DPP with marginal kernel — "^^^yvv^ . 



We introduce the term "elementary" here; Hough et al. [2006 refer to elementary DPPs 



as determinantal projection processes, since is an orthonormal projection matrix to 



the subspace spanned by V. Note that, due to Equation (25), elementary DPPs are not 
generally L-ensembles. We start with a technical lemma. 
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Lemma 2.5. Let Wn for n — 1^2^ ... he an arbitrary sequence ofkxk rank-one matrices^ 
and let Wni denote the ith column ofWn- Let Wj — J2neJ ^^^^ 

dei{Wj)= <^^^{[Wn,lWn,2...Wn^k]). (49) 

n-^ ,722 ' • • • ^'^k ^ 
distinct 

Proof. By multilinearity of the determinant, 

det(T4^j) = dei{[WniWj2 . . . Wjk]) , (50) 

and, by induction, 

det{Wj)= Y d(i^i[WmiWn,2...Wn,k]). (51) 

Since Wn has rank one, the determinant of any matrix containing two or more columns of 
Wn is zero; thus the terms in the sum vanish unless ni, 712, . . . ^n^ are distinct. □ 

Lemma 2.6. A DPP with kernel L = '^n=i ^n^n^n ^ mixture of elementary DPPs: 

^- = adTT) i: ^"^n^-. (52) 

^ ^ JC{l,2,...,Ar} neJ 

where Vj denotes the set {vn}neJ' 

Proof. Consider an arbitrary set A, with A: = l^l. Let Wn = [t'n'^nlA for n = 1, 2, . . . , A^; 
note that all of the Wn have rank one. From the definition of K^-^ , the mixture distribution 



on the right hand side of Equation (52) gives the following expression for the marginal 
probability of A: 

^ ^ JC{l,2,...,Ar} \neJ J neJ 

Applying Lemma [2^ this is equal to 

deta+n E E dei{[Wn,,...W^,,])X{Xn (54) 

^ ^ JC{l,2,...,Ar} ni,...,n;,ej, neJ 

distinct 



distinct 

1 ^ A A ^ 

distinct 



(55) 
(56) 
(57) 
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using the fact that det(L + /) = 11^=1 (^ri + !)• Applying Lemma 
the definition of K in terms of the eigendecomposition of L, we 
probabihty of A given by the mixture is 



dei{KA) 



Since the two distributions agree on all marginals, they are equal. 
Next, we show that elementary DPPs have fixed cardinality. 



2.5 



in reverse and then 



lave that the marginal 




(58) 



□ 



Lemma 2.7. If Y is drawn according to an elementary DPP , then \Y\ = \V\ with 
probability one. 

Proof. Since has rank \V\, r^{Y C y) = whenever \Y\ > \V\, so |y| < \V\. But we 
also have 



E[\Y\] = E 

N 



N 



J^HneY) 
J^ElKneY)] 

n=l 
N 

Y^Knn-tT{K) = \V\. 



n=l 



Thus = \V\ almost surely. 
We can now prove the theorem. 



(59) 
(60) 

(61) 

□ 



Proof of Theorem \2.3\ Lemma 2^ says that the mixture weight of V^-^ is given by the 
product of the eigenvalues corresponding to the eigenvectors ^Vj^ normalized by 
det(L + /) = n^=i(^n + !)• This shows that the first loop of Algorithm [l] selects an 
elementary DPP with probability equal to its mixture component. All that remains is 
to show that the second loop samples Y ^ . 

Let B represent the matrix whose rows are the eigenvectors in so that — B 
Using the geometric interpretation of determinants introduced in Section 



2.2.1 



det(K^; 



IS 



equal to the squared volume of the parallelepiped spanned by {Bi}i^Y' Note that since V is 
an orthonormal set, Bi is just the projection of ei onto the subspace spanned by V . 

Let k — \V\. By Lemma 2.7 and symmetry, we can consider without loss of generality 



a single Y — {1,2,..., k}. Using the fact that any vector both in the span of V and 
perpendicular to is also perpendicular to the projection of onto the span of V ^ by the 
base X height formula for the volume of a parallelepiped we have 



Vol({SO,^y) = ||Si||Vol ({Proj^.^Sj 



k 

1 = 2 j 5 



(62) 



where Proj^^^ is the projection operator onto the subspace orthogonal to ei. Proceeding 
inductively. 



Vol({Si}iey) = ||Si||||Proj^^^S2||---||Proj^^^^ 



,efc-i 



Bk 



(63) 
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Assume that, as iteration j of the second loop in Algorithm [T] begins, we have already 
selected = 1, ^2 = 2, . . . , yj-i — j — 1. Then V in the algorithm has been updated to an 
orthonormal basis for the subspace of the original V perpendicular to ei, . . . , ej-i, and the 
probability of choosing yj — j is exactly 

W\ - ^4TTl|P-j^e....,e,_.i?.f • (64) 

Therefore, the probability of selecting the sequence 1, 2, . . . , is 

^||i?if||Proj^,^i?2f •••||Proj^,^,...,,^_^i?fcf = ivol2({i?,}i6y) • (65) 

Since volume is symmetric, the argument holds identically for all of the A:! orderings of Y. 
Thus the total probability that Algorithm [l] selects Y is det(Ky). □ 

Corollary 2.8. Algorithm^ generates Y in uniformly random order. 

Discussion To get a feel for the sampling algorithm, it is useful to visualize the distributions 
used to select i at each iteration, and to see how they are influenced by previously chosen 
items. Figure [8a| shows this progression for a simple DPP where 3^ is a flnely sampled grid 
of points in [0, 1], and the kernel is such that points are more similar the closer together 
they are. Initially, the eigenvectors V give rise to a fairly uniform distribution over points 
in 3^, but as each successive point is selected and V is updated, the distribution shifts to 



avoid points near those already chosen. Figure [8b] shows a similar progression for a DPP 
over points in the unit square. 

The sampling algorithm also offers an interesting analogy to clustering. If we think of 
the eigenvectors of L as representing soft clusters, and the eigenvalues as representing their 
strengths — the way we do for the eigenvectors and eigenvalues of the Laplacian matrix in 
spectral clustering — then a DPP can be seen as performing a clustering of the elements in 3^, 
selecting a random subset of clusters based on their strength, and then choosing one element 
per selected cluster. Of course, the elements are not chosen independently and cannot be 
identified with specific clusters; instead, the second loop of Algorithm [T] coordinates the 
choices in a particular way, accounting for overlap between the eigenvectors. 

Algorithm [l] runs in time 0{Nk^)^ where k = \V\ is the number of eigenvectors selected in 
phase one. The most expensive operation is the 0{NkP') Gram-Schmidt orthonormalization 
required to compute V^. li k is large, this can be reasonably expensive, but for most 
applications we do not want high-cardinality DPPs. (And if we want very high-cardinality 
DPPs, we can potentially save time by using Equation ( [27| to sample the complement instead.) 
In practice, the initial eigendecomposition of L is often the computational bottleneck, 
requiring 0{N^) time. Modern multi-core machines can compute eigendecompositions up 
to N ^ 1,000 at interactive speeds of a few seconds, or larger problems up to ^ 10,000 
in around ten minutes. In some instances, it may be cheaper to compute only the top k 
eigenvectors; since phase one tends to choose eigenvectors with large eigenvalues anyway, 
this can be a reasonable approximation when the kernel is expected to be low rank. Note 
that when multiple samples are desired, the eigendecomposition needs to be performed only 
once. 
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step Step 1 Step 2 

(a) Sampling points on an interval 

Step 1 Step 2 Step 3 Step 4 



□ 




□ 










m 



steps Step 6 Step 7 Step 8 

(b) Sampling points in the plane 



Figure 8: Sampling DPP over one-dimensional (top) and two-dimensional (bottom) particle 
positions. Red circles indicate already selected positions. On the bottom, lighter color 
corresponds to higher probability. The DPP naturally reduces the probabilities for positions 
that are similar to those already selected. 
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Deshpande and Rademacher |201Q| recently proposed a (1 — e)-approximate algorithm for 
sampling that runs in time 0{N^ log + A^log^ N^^^ log(| log A^)) when L is already 
decomposed as a Gram matrix, L = B. When B is known but an eigendecomposition is 
not (and N is sufficiently large), this may be significantly faster than the exact algorithm. 

2.4.5 Finding the mode 

Finding the mode of a DPP — that is, finding the set Y C y that maximizes Vl{Y) — is 
NP-hard. In conditional models, this problem is sometimes referred to as maximum a 
posteriori (or MAP) inference, and it is also NP-hard for most general structured models 
such as Markov random fields. Hardness was first shown for DPPs by Ko et al. [1995| , 



who studied the closely-related maximum entropy sampling problem: the entropy of a set 
of jointly Gaussian random variables is given (up to constants) by the log-determinant 
of their covariance matrix; thus finding the maximum entropy subset of those variables 
requires finding the principal covariance submatrix with maximum determinant. Here, we 
adapt the argument of Qivril and Magdon-Ismail |2Q09J , who studied the problem of finding 



maximum- volume submatrices. 

Theorem 2.9. Let DPP-MODE be the optimization problem of finding, for a positive semidef- 
inite N x N input matrix L indexed by elements ofy, the maximum value of det (Ly) over 
all Y ^ y. DPP-MODE is NP-hard, and furthermore it is NP-hard even to approximate 
DPP-MODE to a factor o/ | + e. 

Proof. We reduce from EXACT 3-COVER (X3C). An instance of X3C is a set S and a collection 
C of three-element subsets of S; the problem is to decide whether there is a sub-collection 

C C such that every element of S appears exactly once in (that is, is an exact 
3-cover). X3C is known to be NP-complete. 

The reduction is as follows. Let 3^ = {1, 2, ... , |C|}, and let S be a l^l x \C\ matrix where 
Bsi = if Ci contains s ^ S and zero otherwise. Define L — jB^ B^ where 1 < 7 < |. 
Note that the diagonal of L is constant and equal to 7, and an off-diagonal entry Lij is zero 
if and only if Ci and Cj do not intersect. L is positive semidefinite by construction, and 
the reduction requires only polynomial time. Let /c = We will show that the maximum 
value of det(Ly) is greater than 7^"^ if and only if C contains an exact 3-cover of S. 

(^) If C C C is an exact 3-cover of 5, then it must contain exactly k 3-sets. Letting Y 
be the set of indices in we have Ly = 7/, and thus its determinant is 7^ > 7^"^. 

(^) Suppose there is no 3-cover of S in C. Let Y be an arbitrary subset of y. If |y | < fc, 
then 

det(Ly)<n^n = 7l''l<7'"'. (66) 

ieY 

Now suppose \Y\ > and assume without loss of generality that y = {l,2,...,|y|}. We 
have Ly = jByBy, and 

det(Ly) = 7l^lVol2 ^{Bi}i^Y) ■ (67) 
By the base x height formula, 

Vol({Bi}i6y) = \\Bi\\\\Pioi^B,B2\\ ■ ■ ■ ||Proj^Bi,...,S|^|_iS|y|ll • (68) 
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Note that, since the columns of B are normahzed, each term in the product is at most 
one. Furthermore, at least |y | — + 1 of the terms must be strictly less than one, because 
otherwise there would be k orthogonal columns, which would correspond to a 3-cover. By 
the construction of 5, if two columns Bi and Bj are not orthogonal then Ci and Cj overlap 
in at least one of three elements, so we have 



llProj 



±B. 



BiW — \\Bi 



< WBi 



[Bl Bj] 
'.BAl 



Bi 



(69) 
(70) 



< \ - 



Therefore, 



det(Ly) < 7!^! 



\Y\-k+l 



k-l 



(71) 

(72) 
(73) 



since 7 < |. 

We have shown that the existence of a 3-cover implies the optimal value of DPP-MODE is 
at least 7^, while the optimal value cannot be more than 7^"^ if there is no 3-cover. Thus 
any algorithm that can approximate DPP-MODE to better than a factor of ^ can be used to 

solve X3C in polynomial time. We can choose 7 = | to show that an approximation ratio of 
I + e is NP-hard. □ 



Since there are only \C\ possible cardinalities for Theorem 2.9 shows that DPP-MODE 
is NP-hard even under cardinality constraints. 

Ko et al. , 1995j propose an exact, exponential branch-and-bound algorithm for finding 



the mode using greedy heuristics to build candidate sets; they tested their algorithm on 
problems up to = 75, successfully finding optimal solutions in up to about an hour. 
Modern computers are likely a few orders of magnitude faster; however, this algorithm is 
still probably impractical for applications with large A^. Qivril and Magdon-Ismail |2009| 
propose an efficient greedy algorithm for finding a set of size fc, and prove that it achieves an 
approximation ratio of O(^). While this guarantee is relatively poor for all but very small 
fc, in practice the results may be useful nonetheless. 



Submodularity Vl is log-submodular; that is, 

logVUY U {i}) - logVUY) > logVLiY' U {i}) - logrUY') 



(74) 



whenever Y CY^ C y — {i}. Intuitively, adding elements to Y yields diminishing returns as 
Y gets larger. (This is easy to show by a volume argument.) Submodular functions can be 
minimized in polynomial time |Schrijver, 2000], and many results exist for approximately 
maximizing monotone submodular functions, which have the special property that supersets 
always have a higher function value than their subsets [Nemhauser et al. , 1978, Fisher et al. 



1978, Feige, 1998 . In Section 4.2.1 we will discuss how these kinds of greedy algorithms can 



be adapted for DPPs. However, in general Vl is highly non-monotone, since the addition of 
even a single element can decrease the probability to zero. 
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2.5 Open questions 



While many properties of DPPs are well-understood, some open questions remain. We 
briefly mention two of the most relevant ones. 



2.5.1 Concavity of entropy 

The Shannon entropy of the DPP with marginal kernel K is given by 

H{K) ^ -Y,ny)^ogv{Y) . 



(75) 



YQy 



Conjecture 2.10 ( |Lyons| pOOS] ). H{K) is concave in K . 



While numerical simulation strongly suggests that the conjecture is true, to our knowledge 
no proof currently exists. 



2.5.2 Sum of squares 

In order to calculate, for example, the Hellinger distance between a pair of DPPs, it would 
be useful to be able to compute quantities of the form 

det(Ly)2 . (76) 

To our knowledge it is not currently known whether it is possible to compute such quantities 
efficiently. 



2.6 Related processes 

Historically, a wide variety of point process models have been proposed and applied to 
applications involving diverse subsets, particularly in settings where the items can be 
seen as points in a physical space and diversity is taken to mean some sort of "spreading" 
behavior. However, DPPs are essentially unique among this class in having efficient and 
exact algorithms for probabilistic inference, which is why they are particularly appealing 
models for machine learning applications. In this section we briefly survey the wider world 
of point processes and discuss the computational properties of alternative models; we will 
focus on point processes that lead to what is variously described as diversity, repulsiveness, 
(over)dispersion, regularity, order, and inhibition. 



2.6.1 Poisson point processes 



Perhaps the most fundamental point process is the Poisson point process, which is depicted 
on the right side of Figure [4] | Daley and Vere- Jones, 2003 . While defined for continuous 
in the discrete setting the Poisson point process can be simulated by flipping a coin 
independently for each item, and including those items for which the coin comes up heads. 
Formally, 

V{Y = Y) = \{p,\{{l-pi), (77) 
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where pi G [0, 1] is the bias of the ith. coin. The process is caUed stationary when pi does 
not depend on z; in a spatial setting this means that no region has higher density than any 
other region. 

A random set Y distributed as a Poisson point process has the property that whenever 
A and B are disjoint subsets of 3^, the random variables Y D A and Y D B are independent; 
that is, the points in Y are not correlated. It is easy to see that DPPs generalize Poisson 
point processes by choosing the marginal kernel K with Ku = pi and Kij = 0, z ^ j. This 
implies that inference for Poisson point processes is at least as efficient as for DPPs; in fact, 
it is more efficient, since for instance it is easy to compute the most likely configuration. 
However, since Poisson point processes do not model correlations between variables, they 
are rather uninteresting for most real- world applications. 

Addressing this weakness, various procedural modifications of the Poisson process have 
been proposed in order to introduce correlations between items. While such constructions 
can be simple and intuitive, leading to straightforward sampling algorithms, they tend to 
make general statistical inference difficult. 



Mater n repulsive processes Matern [1960 , 1986| proposed a set of techniques for 
thinning Poisson point processes in order to induce a type of repulsion when the items 
are embedded in a Euclidean space. The Type I process is obtained from a Poisson set Y 
by removing all items in Y that lie within some radius of another item in Y . That is, if 
two items are close to each other, they are both removed; as a result all items in the final 
process are spaced at least a fixed distance apart. The Type II Matern repulsive process, 
designed to achieve the same minimum distance property while keeping more items, begins 
by independently assigning each item in 1^ a uniformly random "time" in [0, 1]. Then, any 
item within a given radius of a point having a smaller time value is removed. Under this 
construction, when two items are close to each other only the later one is removed. Still, 
an item may be removed due to its proximity with an earlier item that was itself removed. 
This leads to the Type III process, which proceeds dynamically, eliminating items in time 
order whenever an earlier point which has not been removed lies within the radius. 

Inference for the Matern processes is computationally daunting. First and second order 
moments can be computed for Types I and II, but in those cases computing the likelihood 



of a set Y is seemingly intractable |M0ller et al., 2010]. Recent work byjHuber and Wolpert 



[2009 shows that it is possible to compute likelihood for certain restricted Type III processes, 
but computing moments cannot be done in closed form. In the general case, likelihood 
for Type III processes must be estimated using an expensive Markov chain Monte Carlo 
algorithm. 

The Matern processes are called "hard-core" because they strictly enforce a minimum 
radius between selected items. While this property leads to one kind of diversity, it is 
somewhat limited, and due to the procedural definition it is difficult to characterize the side 
effects of the thinning process in a general way. Stoyan and Stoyan^ [1985J considered an 
extension where the radius is itself chosen randomly, which may be more natural for certain 
settings, but it does not alleviate the computational issues. 



Random sequential adsorption The Matern repulsive processes are related in spirit 
to the random sequential adsorption (RSA) model, which has been used in physics and 
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chemistry to model particles that bind to two-dimensional surfaces, e.g., proteins on a cell 
membrane [Tanemurai 1979, Finegold and Donnell, 1979, Feder, 1980, Swendsen, 1981 



Hinrichsen et al., 1986| Ramsden, 1993 . RSA is described generatively as follows. Initially, 
the surface is empty; iteratively, particles arrive and bind uniformly at random to a location 
from among all locations that are not within a given radius of any previously bound particle. 
When no such locations remain (the "jamming limit"), the process is complete. 

Like the Matern processes, RSA is a hard-core model, designed primarily to capture 
packing distributions, with much of the theoretical analysis focused on the achievable density. 
If the set of locations is further restricted at each step to those found in an initially selected 



Poisson set then it is equivalent to a Matern Type III process [Huber and Wolpert , 2QQ9| ; 
it therefore shares the same computational burdens. 



2.6.2 Gibbs and Markov point processes 

While manipulating the Poisson process procedurally has some intuitive appeal, it seems 
plausible that a more holistically-defined process might be easier to work with, both 
analytically and algorithmically. The Gibbs point process provides such an approach, 
offering a general framework for incorporating correlations among selected items [Preston 



1976 




2004 





Ripley and Kelly[ [T9771 [Ripleyl [19911 |Van Lieshoutj [20001 |M0ller and Waagepetersen 



2007 



Daley and Vere- Jones 



2008 



The Gibbs probability of a set Y is given by 

V{Y = y) oc exp{-U{Y)) , (78) 

where U is an energy function. Of course, this definition is fully general without further 
constraints on U. A typical assumption is that U decomposes over subsets of items in Y; 
for instance 

exp(-C/(y))= n ^l^l(^) (79) 

for some small constant order k and potential functions ip. In practice, the most common 
case is A; = 2, which is sometimes called a pairwise interaction point process [Diggle et al. , 
[1987] : 

PiY = Y)^YlMi) n Mi J)- (80) 

In spatial settings, a Gibbs point process whose potential functions are identically 1 whenever 
their input arguments do not lie within a ball of fixed radius — that is, whose energy function 
can be decomposed into only local terms — is called a Markov point process. A number of 
specific Markov point processes have become well-known. 



Pairwise Markov processes Strauss [1975 ^ introduced a simple pairwise Markov point 
process for spatial data in which the potential function tp2{hj) is piecewise constant, taking 
the value 1 whenever i and j are at least a fixed radius apart, and the constant value 7 
otherwise. When 7 > 1, the resulting process prefers clustered items. (Note that 7 > 1 
is only possible in the discrete case; in the continuous setting the distribution becomes 
non-integrable.) We are more interested in the case < 7 < 1, where configurations in 
which selected items are near one another are discounted. When 7 = 0, the resulting process 
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becomes hard-core, but in general the Strauss process is "soft-core", preferring but not 
requiring diversity. 

The Strauss process is typical of pairwise Markov processes in that its potential function 
i^2ihj) = — j\) depends only on the distance between its arguments. A variety of 
alternative definitions for ^(•) have been proposed [Ripley and Kelly" , 1977, Ogata and 
Tanemura, 1984]. For instance, 



'0(r) = 1 — exp(— (r/cr)^) 
'^(r) = exp(-((j/r)'^), n> 2 
il;{r) = min(r/(j, 1) 



(81) 
(82) 
(83) 



where a controls the degree of repulsion in each case. Each definition leads to a point process 
with a slightly different concept of diversity. 



Area-interaction point processes Baddeley and Van Lieshout [1995] proposed a non- 
pairwise spatial Markov point process called the area-interaction model, where U(Y) is 
given by log 7 times the total area contained in the union of discs of fixed radius centered at 
all of the items in Y. When 7 > 1, we have log 7 > and the process prefers sets whose 
discs cover as little area as possible, i.e., whose items are clustered. When < 7 < 1, log 7 
becomes negative, so the process prefers "diverse" sets covering as much area as possible. 

If none of the selected items fall within twice the disc radius of each other, then 
ex.p( — U{Y)) can be decomposed into potential functions over single items, since the total 
area is simply the sum of the individual discs. Similarly, if each disc intersects with at most 
one other disc, the area-interaction process can be written as a pairwise interaction model. 
However, in general, an unbounded number of items might appear in a given disc; as a result 
the area-interaction process is an infinite-order Gibbs process. Since items only interact 
when they are near one another, however, local potential functions are sufficient and the 
process is Markov. 



Computational issues Markov point processes have many intuitive properties. In fact, it 
is not difficult to see that, for discrete ground sets 3^, the Markov point process is equivalent 
to a Markov random field (MRF) on binary variables corresponding to the elements of y. 
In Section |3.2.2 we will return to this equivalence in order to discuss the relative expressive 
possibilities of DPPs and MRFs. For now, however, we simply observe that, as for MRFs 
with negative correlations, repulsive Markov point processes are computationally intractable. 



Even computing the normalizing constant for Equation (78) is NP-hard in the cases outlined 



above [Daley and Vere- Jones , 2003, M0ller and Waagepetersen, 2004j . 



On the other hand, quite a bit of attention has been paid to approximate inference 



et al., 1982, Jensen and Moller, 1991, 


Ripley , 


1991 , Markov chain Monte Carlo methods 


Ripley and Kelly, 


1977 


, Besag and Green 


, 1993, 


Haggstrom et al. , 


1999 


, Berthelsen and 


M0ller, 2006 , and other approximations Ogata and Tanemura, 198£ 


), Diggle et al. 


, 1994 . 



Nonetheless, in general these methods are slow and/or inexact, and closed- form expressions 
for moments and densities rarely exist [M0ller and Waagepetersen , 2007j. In this sense the 
DPP is unique. 
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2.6.3 Generalizations of determinants 



The determinant of a A: x matrix K can be written as a polynomial of degree k in the 
entries of in particular, 



det(i^) = 5]sgn(7r)n^i,. 



(0 



(84) 



where the sum is over all permutations tt on 1, 2, . . . , fc, and sgn is the permutation sign 



function. In a DPP, of course, when K is (a submatrix of) the marginal kernel Equation (84) 
gives the appearance probability of the k items indexing K. A natural question is whether 
generalizations of this formula give rise to alternative point processes of interest. 



Immanantal point processes In fact. Equation (84) is a special case of the more general 



matrix immanant, where the sgn function is replaced by x? the irreducible representation- 
theoretic character of the symmetric group on k items corresponding to a particular partition 
of 1, 2, . . . , fc. When the partition has k parts, that is, each element is in its own part, 
x(7r) = sgn(7r) and we recover the determinant. When the partition has a single part, 
x(7r) = 1 and the result is the permanent of K. The associated permanental process was 
first described alongside DPPs by Macchi [1975| , who referred to it as the "boson process." 
Bosons do not obey the Pauli exclusion principle, and the permanental process is in some 
ways the opposite of a DPP, preferring sets of points that are more tightly clustered, or less 
diverse, than if they were independent. Several recent papers have considered its properties 



in some detail [Hough et al. , 2006, McCullagh and M0ller, 2006 1. Furthermore, Diaconis 



and Evans [2000 1 considered the point processes induced by general immanants, showing 
that they are well defined and in some sense "interpolate" between determinantal and 
permanental processes. 

Computationally, obtaining the permanent of a matrix is #P-complete [Valiant , 1979 



making the permanental process difficult to work with in practice. Complexity results for 
immanants are less definitive, with certain classes of immanants apparently hard to compute 
iBiirgisser, 2000] [Brylinski and Brylinskil |2003[ , while some upper bounds on complexity 
are known 



Hartmann 



1985 



computable Grone and Merris, 1984 



Barvinok[ "l990 , and at least one non-trivial case is efficiently 
It is not clear whether the latter result provides 



enough leverage to perform inference beyond computing marginals. 



Q;-determinantal point processes An alternative generalization of Equation ( |84| ) is given 
by the so-called ^-determinant, where sgn(7r) is replaced by a^~^^^\ with z/(7r) counting the 
number of cycles in tt [Vere- Jones , 1997, [Hough et al. , 2Q06| . When a = —1 the determinant 
is recovered, and when a = +1 we have again the permanent. Relatively little is known 
for other values of a, although [Shirai and Takahashi [2003a| conjecture that the associated 
process exists when < a < 2 but not when a > 2. Whether a-determinantal processes 
have useful properties for modeling or computational advantages remains an open question. 



Hyperdeterminantal point processes A third possible generalization of Equation (84) 



is the hyperdeterminant originally proposed by | Cay ley [1843 and discussed in the context of 
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point processes by Evans and Gottlieb 2009 . Whereas the standard determinant operates 



on a two-dimensional matrix with entries indexed by pairs of items, the hyperdeterminant 
operates on higher-dimensional kernel matrices indexed by sets of items. The hyperdetermi- 
nant potentially offers additional modeling power, and Evans and Gottlieb |2009| show that 
some useful properties of DPPs are preserved in this setting. However, so far relatively little 
is known about these processes. 



2.6.4 Quasirandom processes 



Monte Carlo methods rely on draws of random points in order to approximate quantities of 
interest; randomness guarantees that, regardless of the function being studied, the estimates 
will be accurate in expectation and converge in the limit. However, in practice we get to 
observe only a finite set of values drawn from the random source. If, by chance, this set 
is "bad", the resulting estimate may be poor. This concern has led to the development 
of so-called quasirandom sets, which are in fact deterministically generated, but can be 
substituted for random sets in some instances to obtain improved convergence guarantees 
[NiederreiteTl |1992[ |Sobol[ |1998| . 

In contrast with pseudorandom generators, which attempt to mimic randomness by 
satisfying statistical tests that ensure unpredictability, quasirandom sets are not designed to 
appear random, and their elements are not (even approximately) independent. Instead, they 
are designed to have low discrepancy; roughly speaking, low-discrepancy sets are "diverse" in 
that they cover the sample space evenly. Consider a finite subset Y of [0, 1]^, with elements 



X 



for z = 1, 2, . . . , A:. Let Sx = [0, xi) x [0, X2) 



[0, xd) denote 



the box defined by the origin and the point x. The discrepancy of Y is defined as follows. 



disc(y) — max 



\Yns, 



k 



Vol(5, 



(85) 



That is, the discrepancy measures how the empirical density \Y H Sx\/k differs from the 
uniform density Yo\{Sx) over the boxes Sx- Quasirandom sets with low discrepancy cover the 
unit cube with more uniform density than do pseudorandom sets, analogously to Figure [Ij 
This deterministic uniformity property makes quasirandom sets useful for Monte Carlo 
estimation via (among other results) the Koksma-Hlawka inequality [Hlawka , 1961, Nieder-| 
1992|. For a function / with bounded variation V{f) on the unit cube, the inequality 



reiter 



states that 



fix. 



-I 



[0,1]^ 



f{x)dx 



< V{f)disc{Y) . 



(86) 



Thus, low-discrepancy sets lead to accurate quasi-Monte Carlo estimates. In contrast to 
typical Monte Carlo guarantees, the Koksma-Hlawka inequality is deterministic. Moreover, 
since the rate of convergence for standard stochastic Monte Carlo methods is this 
result is an (asymptotic) improvement when the discrepancy diminishes faster than 

In fact, it is possible to construct quasirandom sequences where the discrepancy of the 
first k elements is 0((log A:)^/A:); the first such sequence was proposed by Hal ton] |l960| . 
The Sobol sequence |Sobol[ [1967 , introduced later, offers improved uniformity properties 



and can be generated efficiently 



Brat ley and Fox, 1988 
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It seems plausible that, due to their uniformity characteristics, low-discrepancy sets 
could be used as computationally efficient but non-probabilistic tools for working with 
data exhibiting diversity. An algorithm generating quasirandom sets could be seen as an 
efficient prediction procedure if made to depend somehow on input data and a set of learned 
parameters. However, to our knowledge no work has yet addressed this possibility. 



3 Representation and algorithms 

Determinantal point processes come with a deep and beautiful theory, and, as we have seen, 
exactly characterize many theoretical processes. However, they are also promising models 
for real-world data that exhibit diversity, and we are interested in making such applications 
as intuitive, practical, and computationally efficient as possible. In this section, we present a 
variety of fundamental techniques and algorithms that serve these goals and form the basis 
of the extensions we discuss later. 

We begin by describing a decomposition of the DPP kernel that offers an intuitive 
tradeoff between a unary model of quality over the items in the ground set and a global 
model of diversity. The geometric intuitions from Section [2] extend naturally to this 
decomposition. Splitting the model into quality and diversity components then allows us to 
make a comparative study of expressiveness — that is, the range of distributions that the 
model can describe. We discuss the expressive power of a DPP and compare it to that of 
pairwise Markov random fields with negative interactions, showing that the two models 
are incomparable in general but exhibit qualitatively similar characteristics, despite the 
computational advantages offered by DPPs. 

Next, we turn to the challenges imposed by large data sets, which are common in practice. 
We ffist address the case where the number of items in the ground set, is very large. 
In this setting, the super-linear number of operations required for most DPP inference 
algorithms can be prohibitively expensive. However, by introducing a dual representation of 
a DPP we show that efficient DPP inference remains possible when the kernel is low-rank. 
When the kernel is not low-rank, we prove that a simple approximation based on random 
projections dramatically speeds inference while guaranteeing that the deviation from the 
original distribution is bounded. These techniques will be especially useful in Section [6} 
when we consider exponentially large N. 

Finally, we discuss some alternative formulas for the likelihood of a set Y in terms of the 



marginal kernel K. Compared to the L-ensemble formula in Equation (13), these may be 



analytically more convenient, since they do not involve ratios or arbitrary principal minors. 



3.1 Quality vs. diversity 

An important practical concern for modeling is understandability; that is, practitioners 
should be able to interpret the parameters of the model in an intuitive way. While the 
entries of the DPP kernel are not totally opaque in that they can be seen as measures 
of similarity — reflecting our primary qualitative characterization of DPPs as diversifying 
processes — in most practical situations we want diversity to be balanced against some 
underlying preferences for different items in y. In this section, we propose a decomposition 
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of the DPP that more directly illustrates the tension between diversity and a per-item 
measure of quality. 

In Section [2] we observed that the DPP kernel L can be written as a Gram matrix, 
L — where the columns of B are vectors representing items in the set 3^. We 

now take this one step further, writing each column Bi as the product of a quality term 
qi G and a vector of normalized diversity features (pi G M^, \\(f)i\\ = 1. (While D = N is 
sufficient to decompose any DPP, we keep D arbitrary since in practice we may wish to use 
high-dimensional feature vectors.) The entries of the kernel can now be written as 

Lij = qi(pl(pjqj . (87) 

We can think of qi G as measuring the intrinsic "goodness" of an item z, and (pj (pj G [—1, 1] 
as a signed measure of similarity between items i and j. We use the following shorthand for 
similarity: 

S^J = ^Z-/*. = ^== • (88) 

This decomposition of L has two main advantages. First, it implicitly enforces the 
constraint that L must be positive semidefinite, which can potentially simplify learning (see 
Section [4]). Second, it allows us to independently model quality and diversity, and then 
combine them into a unified model. In particular, we have: 

Vl{Y)^ (n^')det(5y), (89) 

where the first term increases with the quality of the selected items, and the second term 
increases with the diversity of the selected items. We will refer to q as the quality model and 
5 or as the diversity model. Without the diversity model, we would choose high-quality 
items, but we would tend to choose similar high-quality items over and over. Without 
the quality model, we would get a very diverse set, but we might fail to include the most 
important items in focusing instead on low-quality outliers. By combining the two models 
we can achieve a more balanced result. 

Returning to the geometric intuitions from Section |2.2.H the determinant of Ly is equal 
to the squared volume of the parallelepiped spanned by the vectors qi(j)i for i . The 
magnitude of the vector representing item i is qi^ and its direction is (pi. Figure [9] (reproduced 
from the previous section) now makes clear how DPPs decomposed in this way naturally 
balance the two objectives of high quality and high diversity. Going forward, we will nearly 
always assume that our models are decomposed into quality and diversity components. This 
provides us not only with a natural and intuitive setup for real- world applications, but also 
a useful perspective for comparing DPPs with existing models, which we turn to next. 

3 . 2 Expressiveness 

Many probabilistic models are known and widely used within the machine learning community. 
A natural question, therefore, is what advantages DPPs offer that standard models do 
not. We have seen already how a large variety of inference tasks, like sampling and 
conditioning, can be performed efficiently for DPPs; however efficiency is essentially a 
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(b) (c) 

Figure 9: Revisiting DPP geometry: (a) The probability of a subset Y is the square of the 
volume spanned by for i ^Y. (b) As item z's quality qi increases, so do the probabilities 
of sets containing item i. (c) As two items i and j become more similar, increases and 

the probabilities of sets containing both i and j decrease. 



prerequisite for any practical model. What makes DPPs particularly unique is the marriage 
of these computational advantages with the ability to express global, negative interactions 
between modeling variables; this repulsive domain is notoriously intractable using traditional 



approaches like graphical models Murphy et al. 
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[Taskar et al.[|2004[|Yanover and Weiss} |2002[ [Yanover et al.[|2006[|Kulesza and Pereira 



Boros and Hammer 



2002 



Ishikawa 



In this section we elaborate on the expressive powers of DPPs and compare them 



with those of Markov random fields, which we take as representative graphical models. 



3.2.1 Markov random fields 

A Markov random field (MRF) is an undirected graphical model defined by a graph G whose 
nodes 1, 2, . . . , represent random variables. For our purposes, we will consider binary 
MRFs, where each output variable takes a value from {0, 1}. We use yi to denote a value of 
the ith output variable, bold to denote an assignment to a set of variables c, and y for 
an assignment to all of the output variables. The graph edges E encode direct dependence 
relationships between variables; for example, there might be edges between similar elements 
i and j to represent the fact that they tend not to co-occur. MRFs are often referred to 
as conditional random fields when they are parameterized to depend on input data, and 



especially when G is a chain Lafferty et al., 2001 



An MRF defines a joint probability distribution over the output variables that factorizes 
across the cliques C of G: 

'P{y) = ^U'^c{yc)- (90) 



Here each '0c is a potential function that assigns a nonnegative value to every possible 
assignment y^ of the clique c, and Z is the normalization constant ^ / HcgC i^ciVc)- Note 
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that, for a binary MRF, we can think of y as the characteristic vector for a subset Y of 
3^ = {1, 2, ... , N}. Then the MRF is equivalently the distribution of a random subset 1", 
where ViY — Y) is equivalent to V{y). 



The Hammersley-Chfford theorem states that V{y) defined in Equation (90) is always 
Markov with respect to G; that is, each variable is conditionally independent of all other 
variables given its neighbors in G. The converse also holds: any distribution that is Markov 
with respect to G, as long as it is strictly positive, can be decomposed over the cliques of G as 
in Equation (90) [Grimmett, 1973 . MRFs therefore offer an intuitive way to model problem 



structure. Given domain knowledge about the nature of the ways in which outputs interact, 
a practitioner can construct a graph that encodes a precise set of conditional independence 
relations. (Because the number of unique assignments to a clique c is exponential in |c|, 
computational constraints generally limit us to small cliques.) 

For comparison with DPPs, we will focus on pairwise MRFs, where the largest cliques 
with interesting potential functions are the edges; that is, ^l^ciVc) — 1 cliques c where 

|c| > 2. The pairwise distribution is 



1 ^ 



(91) 



ijeE 



We refer to il^i{yi) as node potentials, and ipijiVi^Vj) as edge potentials. 

MRFs are very general models — in fact, if the cliques are unbounded in size, they are 
fully general — but inference is only tractable in certain special cases. [Cooper [1990 showed 
that general probabilistic inference (conditioning and marginalization) in MRFs is NP-hard, 
and this was later extended by Dagum and Luby [l993j , who showed that inference is 
NP-hard even to approximate. Shimony 1994 proved that the MAP inference problem 



(finding the mode of an MRF) is also NP-hard, and Abdelbar and Hedetniemi [1998 showed 
that the MAP problem is likewise hard to approximate. In contrast, we showed in Section |2] 
that DPPs offer efficient exact probabilistic inference; furthermore, although the MAP 
problem for DPPs is NP-hard, it can be approximated to a constant factor under cardinality 
constraints in polynomial time. 

The first tractable subclass of MRFs was identified by Pearl [1982] , who showed that 
belief propagation can be used to perform inference in polynomial time when G is a tree. 
More recently, certain types of inference in binary MRFs with associative (or submodular) 
potentials have been shown to be tractable [Boros and Hamroer , 2002 , [Taskar et aL| [2004 



Kolmogorov and Zabih, 2004]. Inference in non-binary associative MRFs is NP-hard, but 
can be efficiently approximated to a constant factor depending on the size of the largest 
clique [Taskar et al. , 2004 . Intuitively, an edge potential is called associative if it encourages 
the endpoint nodes take the same value (e.g., to be both in or both out of the set Y). More 
formally, associative potentials are at least one whenever the variables they depend on are all 



equal, and exactly one otherwise. We can rewrite the pairwise, binary MRF of Equation (91) 
in a canonical log-linear form: 



V{y) oc exp 



ijeE 



(92) 
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Here we have eliminated redundancies by forcing tpi{0) = 1, '0ij(O, 0) = '0ij(0, 1) = 0) = 

1, and setting Wi = log '0^(1), Wij = log'0^j(l, 1). This parameterization is sometimes called 
the fully visible Boltzmann machine. Under this representation, the MRF is associative 
whenever Wij > for all ij G E. 

We have seen that inference in MRFs is tractable when we restrict the graph to a 
tree or require the potentials to encourage agreement. However, the repulsive potentials 
necessary to build MRFs exhibiting diversity are the opposite of associative potentials (since 
they imply Wij < 0), and lead to intractable inference for general graphs. Indeed, such 
negative potentials can create "frustrated cycles" , which have been used both as illustrations 



of common MRF inference algorithm failures Kulesza and Pereira, 2008| and as targets 
for improving those algorithms ^Sontag and Jaakkola , 2007]. A wide array of (informally) 



approximate inference algorithms have been proposed to mitigate tract ability problems, but 
none to our knowledge effectively and reliably handles the case where potentials exhibit 
strong repulsion. 

3.2.2 Comparing DPPs and MRFs 

Despite the computational issues outlined in the previous section, MRFs are popular models 
and, importantly, intuitive for practitioners, both because they are familiar and because 
their potential functions directly model simple, local relationships. We argue that DPPs 
have a similarly intuitive interpretation using the decomposition in Section jSTTj Here, we 
compare the distributions realizable by DPPs and MRFs to see whether the tractability of 
DPPs comes at a large expressive cost. 

Consider a DPP over 3^ = {1, 2, ... , N} with N x N kernel matrix L decomposed as in 
Section 13.11 we have 



Vl{Y) oc det(Ly) = I n • (93) 

XieY J 

The most closely related MRF is a pairwise, complete graph on N binary nodes with negative 
interaction terms. We let yi = I{i E Y) be indicator variables for the set y, and write the 



MRF in the log-linear form of Equation (92): 



'Pmrf(^) oc exp ^Wiyi ^^Wijyiyj , (94) 
\ ^ / 

where Wij < 0. 

Both of these models can capture negative correlations between indicator variables y^. 
Both models also have parameters: the DPP has quality scores qi and similarity 

measures Sij^ and the MRF has node log-potentials Wi and edge log-potentials Wij. The key 
representational difference is that, while Wij are individually constrained to be nonpositive, 
the positive semidefinite constraint on the DPP kernel is global. One consequence is that, 
as a side effect, the MRF can actually capture certain limited positive correlations; for 
example, a 3-node MRF with wu^wis < and W23 = induces a positive correlation 
between nodes two and three by virtue of their mutual disagreement with node one. As 
we have seen in Section [2| the semidefinite constraint prevents the DPP from forming any 
positive correlations. 
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Figure 10: A factor graph representation of a 3-item MRF or DPP. 



More generally, semidefiniteness means that the DPP diversity feature vectors must 
satisfy the triangle inequality, leading to 

VT^^ + v/T^^ > v/i - Sik (95) 

for all i, j. A: G y since ||(/)^ — || oc y^l — Sij. The similarity measure therefore obeys a type 
of transitivity, with large Sij and Sjk implying large Sik- 



Equation (95) is not, by itself, sufficient to guarantee that L is positive semidefinite, 
since S must also be realizable using unit length feature vectors. However, rather than 
trying to develop further intuition algebraically, we turn to visualization. While it is difficult 
to depict the feasible distributions of DPPs and MRFs in high dimensions, we can get a feel 
for their differences even with a small number of elements N. 

When = 2, it is easy to show that the two models are equivalent, in the sense that 
they can both represent any distribution with negative correlations: 

Viyi = l)V{y2 = 1) > V{yi = 1, y2 = 1) • (96) 

When = 3, differences start to become apparent. In this setting both models have 
six parameters: for the DPP they are (gi, ^2, ^3, 5^12, S'la, 523), and for the MRF they are 
{wi^W2^W'^^wi2^wi'^^W2'^). To place the two models on equal footing, we represent each 
as the product of unnormalized per-item potentials '0i,'02 5'03 and a single unnormalized 
ternary potential ^123- This representation corresponds to a factor graph with three nodes 
and a single, ternary factor (see Figure [To|. The probability of a set Y is then given by 



V{Y) oc '^Ai (1/1)^2 (1/2) '03 (1/3) '0123 (l/i, 1/2, yz) • (97) 

For the DPP, the node potentials are 0f ^^(y^) = q^^', and for the MRF they are il^f^^iyi) = 
e^^y\ The ternary factors are 

V'?23''(yi,y2,y3) = det(5y), (98) 
^123 ^(yi. 1/2,^3) = exp l^^ WijViy^ . (99) 

Since both models allow setting the node potentials arbitrarily, we focus now on the 
ternary factor. Table [l] shows the values of V'm^ V'm ^ for all subsets Y <^y. The 
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Table 1: Values of ternary factors for 3- item MRFs and DPPs. 



last four entries are determined, respectively, by the three edge parameters of the MRF and 
three similarity parameters Sij of the DPP, so the sets of realizable ternary factors form 3-D 
manifolds in 4-D space. We attempt to visualize these manifolds by showing 2-D slices in 
3-D space for various values of '^123(1, 1, 1) (the last row of Table [1]). 

Figure 11 a| depicts four such slices of the realizable DPP distributions, and Figure [TTb 



shows the same slices of the realizable MRF distributions. Points closer to the origin (on 
the lower left) correspond to "more repulsive" distributions, where the three elements of 
y are less likely to appear together. When ^123(1, 1, 1) is large (gray surfaces), negative 
correlations are weak and the two models give rise to qualitatively similar distributions. As 
the value of the '^123(1, 1,1) shrinks to zero (red surfaces), the two models become quite 
different. MRFs, for example, can describe distributions where the first item is strongly 
anti-correlated with both of the others, but the second and third are not anti-correlated 
with each other. The transitive nature of the DPP makes this impossible. 

To improve visibility, we have constrained Su, Sis, S23 > in Figure [TTa , Figure 11c 



shows a single slice without this constraint; allowing negative similarity makes it possible 
to achieve strong three-way repulsion with less pairwise repulsion, closing the surface away 
from the origin. The corresponding MRF slice is shown in Figure |lld[ and the two are 
overlaid in Figure |lle| and Figure |llf[ Even though there are relatively strong interactions 
in these plots ('^123(1, 1,1) = 0.1), the models remain roughly comparable in terms of the 
distributions they can express. 

As N gets larger, we conjecture that the story is essentially the same. DPPs are primarily 
constrained by a notion of transitivity on the similarity measure; thus it would be difficult 
to use a DPP to model, for example, data where items repel "distant" items rather than 
similar items — if i is far from j and j is far from k we cannot necessarily conclude that i is 
far from k. One way of looking at this is that repulsion of distant items induces positive 
correlations between the selected items, which a DPP cannot represent. 

MRFs, on the other hand, are constrained by their local nature and do not effectively 
model data that are "globally" diverse. For instance, a pairwise MRF we cannot exclude a 
set of three or more items without excluding some pair of those items. More generally, an 
MRF assumes that repulsion does not depend on (too much) context, so it cannot express 
that, say, there can be only a certain number of selected items overall. The DPP can 
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(a) DPP slices (partial) 



1 V 
(b) MRF slices 




(c) DPP slice (full) for V^i23(l, 1, 1) = 0.1 (d) MRF slice for V^i23(l, 1, 1) = 0.1 




(e) Overlay (f) Overlay (rotated) 



Figure 11: (a,b) Realizable values of '^123(1, 1,0), '^123(1,0, 1), and '^123(0, 1, 1) in a 3-factor 
when '0123(1,1,1) = 0.001 (red), 0.25 (green), 0.5 (blue), and 0.75 (grey). (c,d) Surfaces 
for 0123(1, 1, 1) = 0.1, allowing negative similarity for the DPP. (e,f) DPP (blue) and MRF 
(red) surfaces superimposed. 
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naturally implement this kind of restriction though the rank of the kernel. 



3.3 Dual representation 



The standard inference algorithms for DPPs rely on manipulating the kernel L through 
inversion, eigendecomposition, and so on. However, in situations where N is large we may 
not be able to work efficiently with L — in some cases we may not even have the memory 
to write it down. In this section, instead, we develop a dual representation of a DPP that 
shares many important properties with the kernel L but is often much smaller. Afterwards, 
we will show how this dual representation can be applied to perform efficient inference. 

Let B be the D x N matrix whose columns are given by Bi = qi4>i^ so that L = B^ B. 
Consider now the matrix 



C = BB^ 



(100) 



By construction, C is symmetric and positive semidefinite. In contrast to L, which is too 
expensive to work with when N is large, C is only D x where D is the dimension of the 
diversity feature function 0. In many practical situations, D is fixed by the model designer, 
while N may grow without bound as new items become available; for instance, a search 
engine may continually add to its database of links. Furthermore, we have the following 
result. 

Proposition 3.1. The nonzero eigenvalues of C and L are identical, and the corresponding 
eigenvectors are related by the matrix B. That is, 



C 



D 
n=l 



is an eigendecomposition of C if and only if 

D 



(101) 



(102) 



is an eigendecomposition of L. 

Proof. In the forward direction, we assume that {{Xn^Vn)}^^^ is an eigendecomposition of 
C. We have 



J2 An (^-^B^Vn^ (^J-B^v^^ = B'^ v^vl^ B 

= B^B = L, 

since Vn are orthonormal by assumption. Furthermore, for any n we have 

2 1 



^ -B-^Vr, 



= —{B^VnViB^Vn) 



= —\n\\Vn\ 
Ar7, 



(103) 
(104) 

(105) 

(106) 

(107) 
(108) 
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using the fact that Cvn — ^ni^n since Vn is an eigenvector of C. FinaUy, for any distinct 
1 < a, 6 < we have 

^ B'^va] (4^B'^Vb) = -j^vJCvb (109) 



^'-vjvb (110) 



0. (Ill) 



Thus I ^A^, I is an eigendecomposition of L. In the other direction, an 

analogous argument apphes once we observe that, since L = L has rank at most D 

and therefore at most D nonzero eigenvalues. □ 



Proposition |3.1| shows that C contains quite a bit of information about L. In fact, C is 
sufficient to perform nearly all forms of DPP inference efficiently, including normalization 
and marginalization in constant time with respect to and sampling in time linear in N. 



3.3.1 Normalization 

Recall that the normalization constant for a DPP is given by det(L + /). If Ai, A2, . . . , Aat 
are the eigenvalues of L, then the normalization constant is equal to 11^=1 (-^ri + 1), since 
the determinant is the product of the eigenvalues of its argument. By Proposition |3.1| the 
nonzero eigenvalues of L are also the eigenvalues of the dual representation C. Thus, we 
have 

D 

det(L + /) = J] (A^ + 1) = det(C + /) . (112) 

n=l 

Computing the determinant of C + / requires 0{D^) time. 



3.3.2 Marginalization 

Standard DPP marginalization makes use of the marginal kernel which is of course as 
large as L. However, the dual representation C can be used to compute the entries of K. We 
first eigendecompose the dual representation as C = Yln=i ^n^^ni^n^ which requires 0{D^) 
time. Then, we can use the definition of K in terms of the eigendecomposition of L as well 



as Proposition |3. 1| to compute 

Ku = j2Y^ABjK? (113) 

n=l 

D 



n=l 



That is, the diagonal entries of K are computable from the dot products between the diversity 
features and the eigenvectors of C; we can therefore compute the marginal probability of 
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Algorithm 2 Sampling from a DPP 



Input: eigendecomposition {{vn, )^n)}n=i ^ 

forn = 1,2, . . . ,7V do 

J ^ J U {n} with prob. 
end for 

V ^ {VnjneJ 

while \V\ > do 

Select i from y with Pr(z) = J2vevi^^ ^i)'^ 

Y ^YUi 

V ^ an orthonormal basis for the subspace of V orthogonal to 
end while 

Output: Y 



a single item i ^ y from an eigendecomposition of C in 0{D^) time. Similarly, given two 
items i and j we have 



^ A 



= E ^-^(i?7*n)(i?7^)n) (115) 
n=l '^'^ + 

= lilj Yl \ lA 'l'J^n){4'Jvn) , (116) 

72=1 

so we can compute arbitrary entries of K in 0{D^) time. This allows us to compute, for 
example, pairwise marginals V{i,j E 1^) = KuKjj — Kfj. More generally, for a set A G 3^, 

I A I = A:, we need to compute ^^^^^"^ entries of K to obtain Ka, and taking the determinant 
then yields V{A C Y). The process requires only 0{D^k^ + k^) time; for small sets \A\ this 
is just quadratic in the dimension of (j). 

3.3.3 Sampling 

Recall the DPP sampling algorithm, which is reproduced for convenience in Algorithm [2j 
We will show that this algorithm can be implemented in a tractable manner by using the 
dual representation C. The main idea is to represent the orthonormal set of vectors in 
as a set V of vectors in R^, with the mapping 



V 



I^B^v I i) G T>} . (117) 



Note that, when V contains eigenvectors of C, this is (up to scale) the relationship established 
by Proposition |3. 1| between eigenvectors v of C and eigenvectors v of L. 



The mapping in Equation (117) has several useful properties. If vi — B^vi and 
V2 — V2-, then vi-\- V2 = (vi + '62)5 cind likewise for any arbitrary linear combination. 
In other words, we can perform implicit scaling and addition of the vectors in V using only 
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their preimages in V. Additionally, we have 

vjv2 = {B^viy{B^V2) (118) 
= vJCv2. (119) 

so we can compute dot products of vectors in V in 0{D'^) time. This means that, for 
instance, we can implicitly normalize v — B^v by updating v ^ ^t'cv ' 

We now show how these operations allow us to efficiently implement key parts of the 
sampling algorithm. Because the nonzero eigenvalues of L and C are equal, the first loop of 
the algorithm, where we choose in index set J, remains unchanged. Rather than using J to 
construct orthonormal V directly, however, we instead build the set V by adding ^^I!;^ for 
every n ^ J. 

In the last phase of the loop, we need to find an orthonormal basis V± for the subspace 
of V orthogonal to a given e^. This requires two steps. In the first, we subtract a multiple 
of one of the vectors in V from all of the other vectors so that they are zero in the ith 
component, leaving us with a set of vectors spanning the subspace of V orthogonal to e^. In 
order to do this we must be able to compute the ith component of a vector v G V; since 
V = B^v^ this is easily done by computing the zth column of and then taking the dot 
product with v. This takes only 0{D) time. In the second step, we use the Gram-Schmidt 
process to convert the resulting vectors into an orthonormal set. This requires a series of 
dot products, sums, and scalings of vectors in V., however, as previously argued all of these 



operations can be performed implicitly. Therefore the mapping in Equation (117) allows us 
to implement the final line of the second loop using only tractable computations on vectors 
in V. 

All that remains, then, is to efficiently choose an item i according to the distribution 

Pr(^) = r^E(^^^^)' (120) 
= ^E((^^*)^^^)' (121) 

in the first line of the while loop. Simplifying, we have 

' ' vev 

Thus the required distribution can be computed in time 0{NDk), where k — \V\. The 
complete dual sampling algorithm is given in Algorithm [sj the overall runtime is 0{NDk^ + 
D^k^). 



3.4 Random projections 

As we have seen, dual DPPs allow us to deal with settings where N is too large to work 
efficiently with L by shifting the computational focus to the dual kernel C, which is only 
D X D. This is an effective approach when D <^ N. Of course, in some cases D might also 
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Algorithm 3 Sampling from a DPP (dual representation) 



Input: eigendecomposition {{vm ^n)}n=i of ^ 

for n = 1,2,..., AT do 

J ^ JU{n} with prob. 
end for 

[v Cv ) 

while \V\ > do 

Select i from 3^ with Pr(i) = ^ E«ey(^>^^i)' 

Let Vq be a vector in V with -Bj^Vo 7^ 
Update V ^ \v- ^^vq \veV - {vq}} 

I ^ Vq Bi ) 

Orthonormalize V with respect to the dot product {vi^V2) — v1Cv2 
end while 
Output: Y 



be unmanageably large, for instance when the diversity features are given by word counts in 
natural language settings, or high-resolution image features in vision. 

To address this problem, we describe a method for reducing the dimension of the diversity 
features while maintaining a close approximation to the original DPP model. Our approach 
is based on applying random projections, an extremely simple technique that nonetheless 
provides an array of theoretical guarantees, particularly with respect to preserving distances 
between points [Vempala , 2004] . A classic result of Johnson and Lindenstrauss [1984], for 



instance, shows that high-dimensional points can be randomly projected onto a logarithmic 
number of dimensions while approximately preserving the distances between them. More 
recently, Magen and Zouzias |2Q08| extended this idea to the preservation of volumes spanned 



by sets of points. Here, we apply the connection between DPPs and spanned volumes to 
show that random projections allow us to reduce the dimensionality of 0, dramatically 
speeding up inference, while maintaining a provably close approximation to the original, 
high-dimensional model. We begin by stating a variant of Magen and Zouzias' result. 



Lemma 3.2. (Adapted from Magen and Zouzias 1200^ ) Let X be a D x N matrix. Fix 



k < N and < e, 5 < 1/2^ and set the projection dimension 

2k 24 flog{3/6) 
log TV 



= max ^ — , - ( -^y^ + 1 ) (log TV + 1) + A: - 1 J> . (123) 



Let G be a d X D random projection matrix whose entries are independently sampled from 
A/'(0, and let Xy, where y C {1, 2, . . . , N}, denote the D x \Y\ matrix formed by taking 
the columns of X corresponding to indices in Y . Then with probability at least 1 — 5 we 
have, for all Y with cardinality at most k: 
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where Vol(Xy) is the k- dimensional volume of the parallelepiped spanned by the columns of 
Xy. 



Lemma [3^2] says that, with high probabihty, randomly projecting to 

d = 0(max{A:/e, (log(l/5) + log N)/e^}) (125) 

dimensions is sufficient to approximately preserve all volumes spanned by k columns of X. 
We can use this result to bound the effectiveness of random projections for DPPs. 

In order to obtain a result that is independent of we will restrict ourselves to the 
portion of the distribution pertaining to subsets Y with cardinality at most a constant k. 
This restriction is intuitively reasonable for any application where we use DPPs to model 
sets of relatively small size compared to which is common in practice. However, formally 
it may seem a bit strange, since it implies conditioning the DPP on cardinality. In Section |5] 
we will show that this kind of conditioning is actually very practical and efficient, and 
Theorem |3.3| which we prove shortly, will apply directly to the fc-DPPs of Section |5] without 



any additional work. 

For now, we will seek to approximate the distribution V-^{Y) = V{Y — Y \ \Y\ < A:), 
which is simply the original DPP conditioned on the cardinality of the modeled subset: 

r<kiy^ = (n.eygDdet(</>(r)^<^(y)) 

^ ' E|y'|<.(n.ey9.')det(0(y)T./>(y))' ^ ^ 

where (j){Y) denotes the D x \Y\ matrix formed from columns (f)i for i ^Y. Our main result 
follows. 

Theorem 3.3. Let V-^{Y) he the cardinality- conditioned DPP distribution defined by quality 
model q and D-dimensional diversity feature function (j), and let 

V^\Y) c ( n qi] det([G,/.(y)]T[G0(y)]) (127) 
\ieY J 

be the cardinality- conditioned DPP distribution obtained by projecting (j) with G. Then for 



projection dimension d as in Equation (123), we have 

\\V-^ -V-^\\i <e^^' -I (128) 

with probability at least 1 — 5. Note that e^^^ — 1 ^ Qke when ke is small. 

The theorem says that for d logarithmic in N and linear in A:, the Li variational distance 
between the original DPP and the randomly projected version is bounded. In order to use 



Lemma 3.2, which bounds volumes of parallelepipeds, to prove this bound on determinants. 



we will make use of the following relationship: 



Vol(Xy) = JdetiX^Xy) . (129) 



In order to handle the conditional DPP normalization constant 



j2 iii<in<i^^(<t>{YV<t^(^))^ (130) 

\Y\<k \ieY I 
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we also must adapt Lemma 3^ to sums of determinants. Finally, for technical reasons we 
will change the symmetry of the upper and lower bounds from the sign of e to the sign of 
the exponent. The following lemma gives the details. 



Lemma 3.4. Under the definitions and assumptions of Lemma \3. 2\ we have, with probability 
at least 1 — 6, 



(l + 2e) 



-2k 



< 



E|y|<fcdet((G'Xy)T(GXy)) 



E|y|<fcdet(XTXy) 



< (1 + e 



,2/c 



(131) 



Proof. 

det((GXy)^(GXy)) = J2 Vol2(GXy) (132) 

\Y\<k \Y\<k 

> J2 (V0l(Xy)(l- 6)1^1)' (133) 

\Y\<k 

> (1-6)2^ J2 Vol2(Xy) (134) 

\Y\<k 

> (1 + 2e)-2^ Yl det(XjXy) , (135) 

\Y\<k 

where the first inequality holds with probability at least 1 — 5 by Lemma |3.2| and the third 
follows from the fact that (1 - e)(l + 2e) > 1 (since e < 1/2), thus (1 - e)^^ > (1 + 2e)-^^. 
The upper bound follows directly: 

(Vol(GXy))2< J2 (vol(Xy)(l + e)l^l)' (136) 

\Y\<k \Y\<k 

< (1 + ef" J2 det(X^Xy) . (137) 

|y|<A: 

□ 



We can now prove Theorem 3.3 



Proof of Theorem\3.3\ Recall the matrix B, whose columns are given by Bi — qi4>i- We 
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have 



|p<fc_p<fc||^^ \P<k^Y)-P^''{Y)\ (138) 
\Y\<k 

P<k^Y) 



|y|<A: 



1 - 



\Y\<k 



P<k{Y) 

det([GSj][G5y]) E|y/|<fcdet(S^,Sy/ 



(139) 



det(fijfiy) E|y/|<fcdet([GSj,][GSy/]) 

< |l - (1 + e)2^(l + 2ef^\ P-^Y) (140) 

|y|<fc 

< e^^' - 1 , (141) 



where the first inequahty fohows from Lemma 3.2 and Lemma [3. 4| which hold simultaneously 
with probability at least 1 — 5, and the second follows from (1 + a)^ < e^^ for a, 6 > 0. □ 

By combining the dual representation with random projections, we can deal simultane- 
ously with very large and very large D. In fact, in Section [6] we will show that N can 
even be exponentially large if certain structural assumptions are met. These techniques 
vastly expand the range of problems to which DPPs can be practically applied. 

3.5 Alternative likelihood formulas 

Recall that, in an L-ensemble DPP, the likelihood of a particular set y C is given by 

This expression has some nice intuitive properties in terms of volumes, and, ignoring the 
normalization in the denominator, takes a simple and concise form. However, as a ratio of 
determinants on matrices of differing dimension, it may not always be analytically convenient. 
Minors can be difficult to reason about directly, and ratios complicate calculations like 
derivatives. Moreover, we might want the likelihood in terms of the marginal kernel 
K = L{L + — I — [L ^ but simply plugging in these identities yields a expression 
that is somewhat unwieldy. 

As alternatives, we will derive some additional formulas that, depending on context, may 
have useful advantages. Our starting point will be the observation, used previously in the 
proof of Theorem |2.2[ that minors can be written in terms of full matrices and diagonal 
indicator matrices; specifically, for positive semidefinite L, 

det(Ly) = det(/yL + ly) (143) 
= (-1)1^1 det(/yL-/^) (144) 
= |det(/yL-/^)| , (145) 

where ly is the diagonal matrix with ones in the diagonal positions corresponding to elements 
of Y and zeros everywhere else, and Y — y — Y . These identities can be easily shown 
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by examining the matrices blockwise under the partition 3^ = y U y, as in the proof of 
Theorem [2^21 

Applying Equation (143) to Equation ( 142[ ), we get 



-PLiY) 



_ det(/yL + ly) 

~ det(L + /) 

= det((/yL + Iy)(L + /)-!) 

= det(/yL(L + + + . 



(146) 

(147) 
(148) 



Already, this expression, which is a single determinant of an x matrix, is in some ways 
easier to work with. We can also more easily write the likelihood in terms of K\ 



Vl{Y) = det(/yK + Iy{I - K)) . 



(149) 



Recall from Equation (27) that I — K is the marginal kernel of the complement DPP; thus, 
in an informal sense we can read Equation (149) as combining the marginal probability that 
Y is selected with the marginal probability that Y is not selected. 



We can also make a similar derivation using Equation (|144|): 
Vl{Y) = 



|y| det(/yL 



' det(L + /) 
-1)1^1 det((IyL - Iy){L + I)-i 
-1)1^1 det(IyL(L + /)-!- /. 
-1)1^1 det(IyK - Iy(I - K)) 

-1)1^1 det(i^-/y) 
|det(/s:-/y)| . 



y(L + I)- 



(150) 

(151) 
(152) 
(153) 
(154) 
(155) 



Note that Equation (149) involves asymmetric matrix products, but Equation ( 155| ) does 
not; on the other hand, K — ly \s (in general) indefinite. 



4 Learning 

We have seen that determinantal point process offer appealing modeling intuitions and 
practical algorithms, capturing geometric notions of diversity and permitting computationally 
efficient inference in a variety of settings. However, to accurately model real- world data 
we must first somehow determine appropriate values of the model parameters. While an 
expert could conceivably design an appropriate DPP kernel from prior knowledge, in general, 
especially when dealing with large data sets, we would like to have an automated method 
for learning a DPP. 

We first discuss how to parameterize DPPs conditioned on input data. We then define 
what we mean by learning, and, using the quality vs. diversity decomposition introduced in 



Section 3.1, we show how a parameterized quality model can be learned efficiently from a 



training set. 
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4.1 Conditional DPPs 



Suppose we want to use a DPP to model the seats in an auditorium chosen by students 
attending a class. (Perhaps we think students tend to spread out.) In this context each 
meeting of the class is a new sample from the empirical distribution over subsets of the 
(fixed) seats, so we merely need to collect enough samples and we should be able to fit our 
model, as desired. 

For many problems, however, the notion of a single fixed base set y is inadequate. For 
instance, consider extractive document summarization, where the goal is to choose a subset 
of the sentences in a news article that together form a good summary of the entire article. 
In this setting 3^ is the set of sentences in the news article being summarized, thus 3^ is not 
fixed in advance but instead depends on context. One way to deal with this problem is to 
model the summary for each article as its own DPP with a separate kernel matrix. This 
approach certainly affords great flexibility, but if we have only a single sample summary for 
each article, there is little hope of getting good parameter estimates. Even more importantly, 
we have learned nothing that can be applied to generate summaries of unseen articles at 
test time, which was presumably our goal in the first place. 

Alternatively, we could let y be the set of all sentences appearing in any news article; this 
allows us to learn a single model for all of our data, but comes with obvious computational 
issues and does not address the other concerns, since sentences are rarely repeated. 

To solve this problem, we need a DPP that depends parametrically on the input data; 
this will enable us to share information across training examples in a justifiable and effective 
way. We first introduce some notation. Let X be the input space; for example, X might be 
the space of news articles. Let y{X) denote the ground set of items implied by an input 
X G A', e.g., the set of all sentences in news article X. We have the following definition. 

Definition 4.1. A conditional DPP V{Y — Y\X) is a conditional probabilistic model 
which assigns a probability to every possible subset Y C y{X). The model takes the form of 
an L-ensemble: 

V{Y = Y\X) oc det(Ly(X)) , (156) 

where L{X) is a positive semidefinite |3^(X)| x |3^(X)| kernel matrix that depends on the 
input. 

As discussed in Section [2| the normalization constant for a conditional DPP can be 
computed efficiently and is given by det(L(X)+/). Using the quality /diversity decomposition 
introduced in Section |3.1[ we have 

L,j{X) = q,(X)ct>,{XycPj(X)qj(X) (157) 

for suitable qi{X) G M+ and ^^(X) G M^, ||0i(X)|| = 1, which now depend on X. 

In the following sections we will discuss application-specific parameterizations of the 
quality and diversity models q and in terms of the input. First, however, we review our 
learning setup. 



4.1.1 Supervised learning 

The basic supervised learning problem is as follows. We receive a training data sample 
{(X(^), y(^))}^^ drawn independently and identically from a distribution D over pairs 
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{X,Y) e A" X 2^(^), where A" is an input space and y{X) is the associated ground set for 
input X. We assume the conditional DPP kernel L(X;9) is parameterized in terms of a 
generic 0, and let 

det(Ly(X;g)) 

^'^^^^^ = demX;e) + I) ^^^^^ 

denote the conditional probability of an output Y given input X under parameter 6. The 
goal of learning is to choose appropriate 9 based on the training sample so that we can make 
accurate predictions on unseen inputs. 

While there are a variety of objective functions commonly used for learning, here we will 
focus on maximum likelihood learning (or maximum likelihood estimation, often abbreviated 
MLE), where the goal is to choose 6 to maximize the conditional log-likelihood of the 
observed data: 



C{9) = \ogl[Vo{Y^'^\X^'^) (159) 

t=i 

T 

= ^iogp^(yW|xW) (160) 

t=i 

T 

= J2 [logdet(L^(,)(xW;0)) - logdet(L(xW; 0) + /)] . (161) 

t=i 

Optimizing C is consistent under mild assumptions; that is, if the training data are actually 
drawn from a conditional DPP with parameter 0*, then the learned ^ 0* as T ^ oc. 
Of course real data are unlikely to exactly follow any particular model, but in any case 
the maximum likelihood approach has the advantage of calibrating the DPP to produce 
reasonable probability estimates, since maximizing C can be seen as minimizing the log-loss 
on the training data. 

To optimize the log-likelihood, we will use standard algorithms such as gradient ascent 
or L-BFGS [Nocedal , 1980| . These algorithms depend on the gradient V C{9)^ which must 



exist and be computable, and they converge to the optimum whenever C{6) is concave in 6. 
Thus, our ability to optimize likelihood efficiently will depend fundamentally on these two 
properties. 



4.2 Learning quality 



We begin by showing how to learn a parameterized quality model qi{X] 9) when the diversity 
feature function (j)i{X) is held fixed [Kulesza and Taskar , 2011b| . This setup is somewhat 
analogous to support vector machines |Vapnik, 2000 , where a kernel is fixed by the practi- 
tioner and then the per-example weights are automatically learned. Here, (j)i{X) can consist 
of any desired measurements (and could even be infinite-dimensional, as long as the resulting 
similarity matrix 5 is a proper kernel). We propose computing the quality scores using a 
log-linear model: 

1 



g,(X;0)=exp -0'/.(^) 



(162) 
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where fi{X) G is a feature vector for item i and the parameter 9 is now concretely an 
element of W^. Note that feature vectors fi{X) are in general distinct from the 
former are used for modeling quality, and will be "interpreted" by the parameters 6, while 
the latter define the diversity model 5, which is fixed in advance. We have 

p n^l.Yl n.er [e^P jO'' U^))] det{Sy{X)) 

' ' J:Y^cyix)UieY'[e^p{0^fi{X))]detiSY'iX))- ^ > 

For ease of notation, going forward we will assume that the training set contains only a 
single instance (X, y), and drop the instance index t. All of the following results extend 
easily to multiple training examples. First, we show that under this parameterization the 
log-likelihood function is concave in 0; then we will show that its gradient can be computed 
efficiently. With these results in hand we will be able to apply standard optimization 
techniques. 

Proposition 4.2. C{9) is concave in 9. 
Proof. We have 

C{e) = logVe{Y\X) (164) 

= ^T^_^^(X)+logdet(5y(X)) 

ieY 

-log exp(0^5^/,(X))det(5y,(^)). (165) 

With respect to 0, the first term is linear, the second is constant, and the third is the 
composition of a concave function (negative log-sum-exp) and a linear function, so the 
overall expression is concave. □ 



We now derive the gradient \/jC{9), using Equation ( |165| ) as a starting point. 



ieY 



log J2 expf^^ J^/,(X))det(5y, 



(166) 



^\-f(X)- V (^^ ^i^y fi^^)) det(5y.(X)) Y.^&' fjjX) 

t^^'^ ^ y,y^^^ ZY'^-Pi0^j:^eY'UX))det{SY^{X)) ^ ' 

= J2f^{X)- J2 re{Y'\X)J2f^iX)■ (168) 

ieY Y'cy(x) ieY' 

Thus, as in standard maximum entropy modeling, the gradient of the log-likelihood can be 
seen as the difference between the empirical feature counts and the expected feature counts 
under the model distribution. The difference here, of course, is that Vo is a DPP, which 
assigns higher probability to diverse sets. Compared with a standard independent model 



obtained by removing the diversity term from Vq^ Equation (168) actually emphasizes those 



training examples that are not diverse, since these are the examples on which the quality 
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Algorithm 4 Gradient of the log-likelihood 



Input: instance {X,Y), parameters 6 
Compute L(X;0) as in Equation (157) 
Eigendecompose L(X; 9) = J2n=i ^n'^n'^Z 
for i G y{X) do 

^ Z^n=l An+1 ni 

end for 

Output: gradient V C{9) 



model must focus its attention in order to overcome the bias imposed by the determinant. 
In the experiments that follow we will see that this distinction is important in practice. 



The sum over Y' in Equation (168) is exponential in |3^(X)|; hence we cannot compute 



it directly. Instead, we can rewrite it by switching the order of summation: 

P.(y'|X)^/,(X) = ^/,(X) P,(y'|X). (169) 

Note that ^Y'z^{i} ^oiX'\^) is the marginal probability of item i appearing in a set sampled 
from the conditional DPP. That is, the expected feature counts are computable directly from 
the marginal probabilities. Recall that we can efficiently marginalize DPPs; in particular, 
per- item marginal probabilities are given by the diagonal of K{X] 0), the marginal kernel 
(which now depends on the input and the parameters). We can compute K{X;9) from the 
kernel L{X;9) using matrix inversion or eigendecomposition. Algorithm [l] shows how we 
can use these ideas to compute the gradient of jC{9) efficiently. 

In fact, note that we do not need all of K(X; 0), but only its diagonal. In Algorithm [4] 
we exploit this in the main loop, using only N'^ multiplications rather than the we would 
need to construct the entire marginal kernel. Unfortunately, the savings is asymptotically 
irrelevant since we still need to eigendecompose L{X;9). It is conceivable that a faster 
algorithm exists for computing the diagonal of K{X] 9) directly, along the lines of ideas 
recently proposed by Tang and Saad |2011| (which focus on sparse matrices); however, we 



are not currently aware of a useful improvement over Algorithm [4j 
4.2.1 Experiments: document summarization 

We demonstrate learning for the conditional DPP quality model on an extractive multi- 
document summarization task using news text. The basic goal is to generate a short piece of 
text that summarizes the most important information from a news story. In the extractive 
setting, the summary is constructed by stringing together sentences found in a cluster of 
relevant news articles. This selection problem is a balancing act: on the one hand, each 
selected sentence should be relevant, sharing significant information with the cluster as a 
whole; on the other, the selected sentences should be diverse as a group so that the summary 
is not repetitive and is as informative as possible given its length [Dang , 2005, Nenkova 



eraL[|2006] . DPPs are a natural fit for this task, viewed through the decomposition of 



Section 3.1 [Kulesza and Taskar , 2011bj . 
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NASA and the Russian Space Agency have agreed to set 
aside a last-minute Russian request to launch an 
international space station into an orbit closer to Mir, 
officials announced Friday. . . . 



A last-minute alarm forced NASA to halt Thursday's 
launching of the space shuttle Endeavour, on a mission to 
start assembling the international space station. This was 
the first time in three years . . . 



The planet's most daring construction job began Friday as 
the shuttle Endeavour carried into orbit six astronauts and 
the first U.S. -built part of an international space station 
that is expected to cost more than $100 billion. . . . 



Following a series of intricate maneuvers and the skillful 
use of the space shuttle Endeavour's robot arm, 
astronauts on Sunday joined the first two of many 
segments that will form the space station . . . 



document cluster 



On Friday the shuttle Endeavor carried six astronauts into orbit to start 
building an international space station. The launch occurred after Russia 
and U.S. officials agreed not to delay the flight in order to orbit closer to 
MIR, and after a last-minute alarm forced a postponement. On Sunday 
astronauts joining the Russian-made Zarya control module cylinder with 
the American-made module to form a 70,000 pounds mass 77 feet 
long. . . . 



human summary 



• NASA and the Russian Space Agency have agreed to set aside . . . 

• A last-minute alarm forced NASA to halt Thursday's launching . . . 

• This was the first time in three years, and 19 flights . . . 

• After a last-minute alarm, the launch went off flawlessly Friday . . 

• Following a series of intricate maneuvers and the skillful . . . 

• It looked to be a perfect and, hopefully, long-lasting fit. . . . 



extractive summary 



Figure 12: A sample cluster from the DUG 2004 test set, with one of the four human 
reference summaries and an (artificial) extractive summary. 



As in Section 4.1 , the input X will be a cluster of documents, and y{X) a set of candidate 
sentences from those documents. In our experiments y{X) contains all sentences from all 
articles in the cluster, although in general preprocessing could also be used to try to improve 
the candidate set [Conroy et al. , 2QQ4| . We will learn a DPP to model good summaries Y for 
a given input X. Because DPPs model unordered sets while summaries are linear text, we 
construct a written summary from Y by placing the sentences it contains in the same order 
in which they appeared in the original documents. This policy is unlikely to give optimal 



results, but it is consistent with prior work |Lin and Bilmes, 2010 and seems to perform well 



Furthermore, it is at least partially justified by the fact that modern automatic summary 
evaluation metrics like ROUGE, which we describe later, are mostly invariant to sentence 
order. 

We experiment with data from the multi-document summarization task (Task 2) of 



the 2003 and 2004 Document Understanding Gonference (DUG) pang! |2005] . The article 
clusters used for these tasks are taken from the NIST TDT collection. Each cluster contains 
approximately 10 articles drawn from the AP and New York Times newswires, and covers a 
single topic over a short time span. The clusters have a mean length of approximately 250 
sentences and 5800 words. The 2003 task, which we use for training, contains 30 clusters, 
and the 2004 task, which is our test set, contains 50 clusters. Each cluster comes with four 
reference human summaries (which are not necessarily formed by sentences from the original 
articles) for evaluation purposes. Summaries are required to be at most 665 characters in 
length, including spaces. Figure [12] depicts a sample cluster from the test set. 

To measure performance on this task we follow the original evaluation and use ROUGE, 
an automatic evaluation metric for summarization [Lin , 2004| . ROUGE measures n-gram 
overlap statistics between the human references and the summary being scored, and combines 
them to produce various sub-metrics. ROUGE- 1, for example, is a simple unigram recall 



measure that has been shown to correlate quite well with human judgments [Lin, 20041. 
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Algorithm 5 Constructing extractive training data 

Input: article cluster X, human reference word counts H, character limit h 

u ^ y{x) 
y ^ 

while ?7 7^ do 

• , ( ROUGE-lF(words(i'),i?)\ 

Y ^YU{i} 

H ^ max(i7 — words(z), 0) 
U - {{i} U {z'|length(y) + length(zO > h}) 
end while 

Output: extractive oracle summary Y 



Here, we use ROUGE's unigram F-measure (which combines ROUGE- 1 with a measure of 
precision) as our primary metric for development. We refer to this measure as ROUGE- IF. 
We also report ROUGE- IP and ROUGE- IR (precision and recall, respectively) as well as 
R0UGE-2F and R0UGE-SU4F, which include bigram match statistics and have also been 
shown to correlate well with human judgments. Our implementation uses ROUGE version 
1.5.5 with stemming turned on, but without stopword removal. These settings correspond 
to those used for the actual DUG competitions [Dang , 2005| ; however, we use a more recent 
version of ROUGE. 

Training data Recall that our learning setup requires a training sample of pairs (X, y), 
where Y C y{X). Unfortunately, while the human reference summaries provided with the 
DUG data are high-quality, they are not extractive, thus they do not serve as examples of 
summaries that we can actually model. To obtain high-quality extractive "oracle" summaries 
from the human summaries, we employ a simple greedy algorithm (Algorithm [5]) . On each 
round the sentence that achieves maximal unigram F-measure to the human references, 
normalized by length, is selected and added to the extractive summary. Since high F-measure 
requires high precision as well as recall, we then update the references by removing the 
words "covered" by the newly selected sentence, and proceed to the next round. 

We can measure the success of this approach by calculating ROUGE scores of our oracle 
summaries with respect to the human summaries. Table [2] shows the results for the DUG 
2003 training set. For reference, the table also includes the ROUGE scores of the best 
automatic system from the DUG competition in 2003 ("machine"), as well as the human 
references themselves ("human"). Note that, in the latter case, the human summary being 
evaluated is also one of the four references used to compute ROUGE; hence the scores are 
probably significantly higher than a human could achieve in practice. Furthermore, it has 
been shown that extractive summaries, even when generated optimally, are by nature limited 



in quality compared to unconstrained summaries [Genest et al. , 2010 . Thus we believe that 



the oracle summaries make strong targets for training. 

Features We next describe the feature functions that we use for this task. For diversity 
features 0i(X), we generate standard normalized tf-idf vectors. We tokenize the input test. 
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System rouge-if rouge-2F rouge-su4F 



Machine 35.17 9.15 12.47 

Oracle 46.59 16.18 19.52 

Human 56.22 33.37 36.50 



Table 2: ROUGE scores for the best automatic system from DUG 2003, our heuristically- 
generated oracle extractive summaries, and human summaries. 

remove stop words and punctuation, and apply a Porter stemmerj^ Then, for each word 
the term frequency tii{w) of w in sentence i is defined as the number of times the word 
appears in the sentence, and the inverse document frequency idf{w) is the negative logarithm 
of the fraction of articles in the training set where w appears. A large value of idf(w) implies 
that w is relatively rare. Finally, the vector has one element per word, and the value 

of the entry associated with word w is proportional to tfi{w)idf{w). The scale of (f)i{X) is 
set such that ||(/)i(X)|| = 1. 

Under this definition of 0, the similarity Sij between sentences i and j is known as their 
cosine similarity: 



Two sentences are cosine similar if they contain many of the same words, particularly words 
that are uncommon (and thus more likely to be salient). 

We augment (f)i{X) with an additional constant feature taking the value p > 0, which is 
a hyperparameter. This has the effect of making all sentences more similar to one another, 
increasing repulsion. We set p to optimize ROUGE- IF score on the training set; in our 
experiments, the best choice was p = 0.7. 

We use the very standard cosine distance as our similarity metric because we need 
to be confident that it is sensible; it will remain fixed throughout the experiments. On 
the other hand, weights for the quality features are learned, so we can use a variety of 
intuitive measures and rely on training to find an appropriate combination. The quality 
features we use are listed below. For some of the features, we make use of cosine distances; 
these are computed using the same tf-idf vectors as the diversity features. When a feature 
is intrinsically real-valued, we produce a series of binary features by binning. The bin 
boundaries are determined either globally or locally. Global bins are evenly spaced quantiles 
of the feature values across all sentences in the training set, while local bins are quantiles of 
the feature values in the current cluster only. 

• Constant: A constant feature allows the model to bias towards summaries with a 
greater or smaller number of sentences. 

• Length: We bin the length of the sentence (in characters) into five global bins. 

• Document position: We compute the position of the sentence in its original document 
and generate binary features indicating positions 1-5, plus a sixth binary feature 

^Code for this preprocessing pipeline was provided by Hui Lin and Jeff Bilmes. 




E^tf,(«;)tf,(^)idf^(^) 



e[0, 1]. 



(170) 
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indicating all other positions. We expect that, for newswire text, sentences that appear 
earlier in an article are more likely to be useful for summarization. 



• Mean cluster similarity: For each sentence we compute the average cosine distance 
to all other sentences in the cluster. This feature attempts to measure how well the 
sentence reflects the salient words occurring most frequently in the cluster. We use 
the raw score, five global bins, and ten local bins. 

• LexRank: We compute continuous LexRank scores by finding the principal eigenvector 
of the row-normalized cosine similarity matrix. (See Erkan and Radev [2004 for details.) 



This provides an alternative measure of centrality. We use the raw score, five global 
bins, and five local bins. 

• Personal pronouns: We count the number of personal pronouns ("he", "her", 
"themselves", etc.) appearing in each sentence. Sentences with many pronouns may be 
poor for summarization since they omit important entity names. 

In total we have 40 quality features; including p our model has 41 parameters. 

Inference At test time, we need to take the learned parameters 9 and use them to predict 
a summary Y for a previously unseen document cluster X. One option is to sample from 
the conditional distribution, which can be done exactly and efficiently, as described in 
Section 2.4.4[ However, sampling occasionally produces low-probability predictions. We 



obtain better performance on this task by applying two alternative inference techniques. 

Greedy MAP approximation. One common approach to prediction in probabilistic models 
is maximum a posteriori (MAP) decoding, which selects the highest probability configuration. 
For practical reasons, and because the primary metrics for evaluation were recall-based, 
the DUG evaluations imposed a length limit of 665 characters, including spaces, on all 
summaries. In order to compare with prior work we also apply this limit in our tests. Thus, 
our goal is to find the most likely summary, subject to a budget constraint: 

y^^P = argmax Ve{Y\X) 

Y 

s.t. ^ length(z) < 6 , (171) 

i^Y 

where length(z) is the number of characters in sentence z, and h — 665 is the limit on the total 



length. As discussed in Section 2.4.5, computing y^^P exactly is NP-hard, but, recalling 



that the optimization in Equation (ITlj) is submodular, we can approximate it through a 
simple greedy algorithm (Algorithm [6^7 



Algorithm [gl is closely related to those given by'Krause and Guestrin [2005] and especially 
les] 



Lin and Bilmes) [2010 1. As discussed in Section [2.4.5 , algorithms of this type have formal 



approximation guarantees for monotone submodular problems. Our MAP problem is not 
generally monotone; nonetheless. Algorithm |6] seems to work well in practice, and is very 
fast (see Table [s]). 

Minimum Bayes risk decoding. The second inference technique we consider is minimum 
Bayes risk (MBR) decoding. First proposed by Goel and Byrne [2000 for automatic speech 
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Algorithm 6 Approximately computing the MAP summary 
Input: document cluster A, parameter 0, character limit b 

u ^ y{x) 
y ^0 

while U ^0 do 

% ^ argmax^/^^ length(i) ) 

Y ^Y\^{^) 

U ^U- {{i} U {z'|length(y) + length(zO > b}) 
end while 

Output: summary Y 



recognition, MBR decoding has also been used successfully for word alignment and machine 
translation [Kumar and Byrne , 2002, 2004]. The idea is to choose a prediction that minimizes 



a particular application-specific loss function under uncertainty about the evaluation target. 
In our setting we use ROUGE-IF as a (negative) loss function, so we have 

yMBR _ g^j.ginax E [ROUGE-lF(y, Y*)] , (172) 
Y 

where the expectation is over realizations of 1^*, the true summary against which we are 
evaluated. Of course, the distribution of 1^* is unknown, but we can assume that our 
trained model Vo{-\X) gives a reasonable approximation. Since there are exponentially many 
possible summaries, we cannot expect to perform an exact search for Y^^^; however, we 
can approximate it through sampling, which is efficient. 

Combining these approximations, we have the following inference rule: 

1 ^ 

y™ _ argmax - ^ ROUGE-iF(y^\ y^) , (173) 

where y^,y^,...,y^ are samples drawn from Vo{-\X). In order to satisfy the length 
constraint imposed by the evaluation, we consider only samples with length between 660 
and 680 characters (rejecting those that fall outside this range), and crop Y^^^ to the limit 
of 665 bytes if necessary. The choice of i? is a tradeoff between fast running time and quality 
of inference. In the following section, we report results for R = 100, 1000, and 5000; Table |3] 
shows the average time required to produce a summary under each setting. Note that MBR 
decoding is easily parallelizable, but the results in Table [3] are for a single processor. Since 
MBR decoding is randomized, we report all results averaged over 100 trials. 

Results We train our model with a standard L-BFGS optimization algorithm. We place a 
zero-mean Gaussian prior on the parameters 0, with variance set to optimize ROUGE- IF on 
a development subset of the 2003 data. We learn parameters 9 on the DUG 2003 corpus, and 
test them using DUG 2004 data. We generate predictions from the trained DPP using the 
two inference algorithms described in the previous section, and compare their performance 
to a variety of baseline systems. 
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System 


Time (s) 


DPP-GREEDY 


0.15 


DPP-MBRlOO 


1.30 


DPP-MBRlOOO 


16.91 


DPP-MBR5000 


196.86 



Table 3: The average time required to produce a summary for a single cluster from the DUG 
2004 test set (without parallelization) . 



Our first and simplest baseline merely returns the first 665 bytes of the cluster text. 
Since the clusters consist of news articles, this is not an entirely unreasonable summary in 
many cases. We refer to this baseline as BEGIN. 

We also compare against an alternative DPP-based model with identical similarity 
measure and quality features, but where the quality model has been trained using standard 
logistic regression. To learn this baseline, each sentence is treated as a unique instance to be 
classified as included or not included, with labels derived from our training oracle. Thus, 
it has the advantages of a DPP at test time, but does not take into account the diversity 
model while training; comparing to this baseline allows us to isolate the contribution of 
learning the model parameters in context. Note that MBR inference is impractical for this 
model because its training does not properly calibrate for overall summary length, so nearly 
all samples are either too long or too short. Thus, we report only the results obtained from 
greedy inference. We refer to this model as LR+DPP. 

Next, we employ as baselines a range of previously proposed methods for multi-document 
summarization. Perhaps the simplest and most popular is Maximum Marginal Relevance 
(MMR), which uses a greedy selection process [Carbonell and Goldstein , 1998 . MMR 



relies on a similarity measure between sentences, for which we use the cosine distance 
measure 5, and a measure of relevance for each sentence, for which we use the same logistic 
regression-trained quality model as above. Sentences are chosen iteratively according to 



arg max 



aqi{X) — (1 — a) maxS^j 
jeY 



(174) 



where Y is the set of sentences already selected (initially empty), qi{X) is the learned quality 
score, and Sij is the cosine similarity between sentences i and j. The tradeoff a is optimized 
on a development set, and sentences are added until the budget is full. We refer to this 
baseline as LR+MMR. 

We also compare against the three highest-scoring systems that actually competed in 
the DUG 2004 competition — peers 65, 104, and 35 — as well as the submodular graph-based 
approach recently described by Lin and Bilmes [2010 , which we refer to as SUBMODl, and 



the improved submodular learning approach proposed by Lin and Bilmes 2012| , which we 
denote SUBMOD2. We produced our own implementation of SUBMODl, but rely on previously 
reported numbers for SUBMOD2, which include only ROUGE-1 scores. 

Table |4] shows the results for all methods on the DUG 2004 test corpus. Scores for the 
actual DUG competitors differ slightly from the originally reported results because we use 
an updated version of the ROUGE package. Bold entries highlight the best performance 
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bystem 


ROUGE-IF 


ROUGE-IP 


ROUGE-IR 


ROUGE-2F 


ROUGE-SU4F 


BEGIN 


32.08 


31.53 


32.69 


6.52 


10.37 


LR+MMR 


37.58 


37.15 


38.05 


9.05 


13.06 


LR+DPP 


37.96 


37.67 


38.31 


8.88 


13.13 


PEER 35 


37.54 


37.69 


37.45 


8.37 


12.90 


PEER 104 


37.12 


36.79 


37.48 


8.49 


12.81 


PEER 65 


37.87 


37.58 


38.20 


9.13 


13.19 


SUBMODl 


38.73 


38.40 


39.11 


8.86 


13.11 


SUBMOD2 


39.78 


39.16 


40.43 






DPP-GREEDY 


38.96 


38.82 


39.15 


9.86 


13.83 


DPP-MBRlOO 


38.83 


38.06 


39.67 


8.85 


13.38 


DPP-MBRlOOO 


39.79 


38.96 


40.69 


9.29 


13.87 


DPP-MBR5000 


40.33 


39.43 


41.31 


9.54 


14.13 



Table 4: ROUGE scores on the DUG 2004 test set. 



Features 


ROUGE-IF 


ROUGE-IP 


ROUGE-IR 


All 


38.96 


38.82 


39.15 


All but length 


37.38 


37.08 


37.72 


All but position 


36.34 


35.99 


36.72 


All but similarity 


38.14 


37.97 


38.35 


All but LexRank 


38.10 


37.92 


38.34 


All but pronouns 


38.80 


38.67 


38.98 


All but similarity, LexRank 


36.06 


35.84 


36.32 



Table 5: ROUGE scores for dpp-GREEDY with features removed. 



in each column; in the case of MBR inference, which is stochastic, the improvements are 
significant at 99% confidence. The DPP models outperform the baselines in most cases; 
furthermore, there is a significant boost in performance due to the use of DPP maximum 
likelihood training in place of logistic regression. MBR inference performs best, assuming we 
take sufficiently many samples; on the other hand, greedy inference runs more quickly than 
DPP-MBRlOO and produces superior results. Relative to most other methods, the DPP model 
with MBR inference seems to more strongly emphasize recall. Note that MBR inference was 
performed with respect to ROUGE-IF, but could also be run to optimize other metrics if 
desired. 

Feature contributions. In Table [5] we report the performance of DPP-GREEDY when 
different groups of features from Section 4.2.1 are removed, in order to estimate their 
relative contributions. Length and position appear to be quite important; however, although 
individually similarity and LexRank scores have only a modest impact on performance, 
when both are omitted the drop is significant. This suggests, intuitively, that these two 
groups convey similar information — both are essentially measures of centrality — but that 
this information is important to achieving strong performance. 
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5 A;-DPPs 



A determinantal point process assigns a probability to every subset of the ground set 3^. 
This means that, with some probabihty, a sample from the process will be empty; with 
some probability, it will be all of 3^. In many cases this is not desirable. For instance, we 
might want to use a DPP to model the positions of basketball players on a court, under the 
assumption that a team tends to spread out for better coverage. In this setting, we know 
that with very high probability each team will have exactly five players on the court. Thus, 
if our model gives some probability of seeing zero or fifty players, it is not likely to be a 
good fit. 

We showed in Section 2. 4. 4| that there exist elementary DPPs having fixed cardinality k] 
however, this is achieved only by focusing exclusively (and equally) on k specific "aspects" 
of the data, as represented by eigenvectors of the kernel. Thus, for DPPs, the notions of size 
and content are fundamentally intertwined. We cannot change one without affecting the 
other. This is a serious limitation on the types of distributions than can be expressed; for 
instance, a DPP cannot even capture the uniform distribution over sets of cardinality k. 

More generally, even for applications where the number of items is unknown, the size 
model imposed by a DPP may not be a good fit. We have seen that the cardinality of a 
DPP sample has a simple distribution: it is the number of successes in a series of Bernoulli 
trials. But while this distribution characterizes certain types of data, other cases might look 
very different. For example, picnickers may tend to stake out diverse positions in a park, but 
on warm weekend days there might be hundreds of people, and on a rainy Tuesday night 
there are likely to be none. This bimodal distribution is quite unlike the sum of Bernoulli 
variables imposed by DPPs. 

Perhaps most importantly, in some cases we do not even want to model cardinality at 
all, but instead offer it as a parameter. For example, a search engine might need to deliver 
ten diverse results to its desktop users, but only five to its mobile users. This ability to 
control the size of a DPP "on the fiy" can be crucial in real-world applications. 

In this section we introduce fc-DPPs, which address the issues described above by 
conditioning a DPP on the cardinality of the random set Y . This simple change effectively 
divorces the DPP content model, with its intuitive diversifying properties, from the DPP 
size model, which is not always appropriate. We can then use the DPP content model with 
a size model of our choosing, or simply set the desired size based on context. The result is 
a significantly more expressive modeling approach (which can even have limited positive 
correlations) and increased control. 

We begin by defining fc-DPPs. The conditionalization they require, though simple in 
theory, necessitates new algorithms for inference problems like normalization and sampling. 
Naively, these tasks require exponential time, but we show that through recursions for 
computing elementary symmetric polynomials we can solve them exactly in polynomial time. 
Finally, we demonstrate the use of fc-DPPs on an image search problem, where the goal is 
to show users diverse sets of images that correspond to their query. 
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5.1 Definition 



A fc-DPP on a discrete set 3^ = {1, 2, ... , A^} is a distribution over all subsets Y cy with 
cardinality k [Kulesza and Taskar[ 2011a| . In contrast to the standard DPP, which models 
both the size and content of a random subset a fc-DPP is concerned only with the content 
of a random fc-set. Thus, a fc-DPP is obtained by conditioning a standard DPP on the event 
that the set Y has cardinality k. Formally, the fc-DPP 7^^ gives probabilities 

l^\Y^\=k det(Ly/j 

where \Y\ — k and L is a positive semidefinite kernel. Compared to the standard DPP, the 
only changes are the restriction on Y and the normalization constant. While in a DPP every 
fc-set Y competes with all other subsets of in a fc-DPP it competes only with sets of the 
same cardinality. This subtle change has significant implications. 

For instance, consider the seemingly simple distribution that is uniform over all sets 
Y C y with cardinality k. If we attempt to build a DPP capturing this distribution we 
quickly run into difficulties. In particular, the marginal probability of any single item is 
so the marginal kernel if it exists, must have on the diagonal. Likewise, the marginal 

probability of any pair of items is , and so by symmetry the off diagonal entries of K 

must be equal to a constant. As a result, any valid marginal kernel has to be the sum of a 
constant matrix and a multiple of the identity matrix. Since a constant matrix has at most 
one nonzero eigenvalue and the identity matrix is full rank, it is easy to show that, except 
in the special cases = 0, 1, A^ — 1, the resulting kernel has full rank. But we know that a 
full rank kernel implies that the probability of seeing all A^ items together is nonzero. Thus 
the desired process cannot be a DPP unless A: = 0, 1, A^ — 1, or A^. On the other hand, a 
fc-DPP with the identity matrix as its kernel gives the distribution we are looking for. This 
improved expressiveness can be quite valuable in practice. 



5.1.1 Alternative models of size 

Since a fc-DPP is conditioned on cardinality, k must come from somewhere outside of the 
model. In many cases, k may be fixed according to application needs, or perhaps changed 
on the fly by users or depending on context. This flexibility and control is one of the major 
practical advantages of fc-DPPs. Alternatively, in situations where we wish to model size as 
well as content, a fc-DPP can be combined with a size model Vsize that assigns a probability 
to every possible /c G {1, 2, . . . , A^}: 

V{Y) = VsiU\Y\)'PL^{Y)' (176) 

Since the fc-DPP is a proper conditional model, the distribution V is well-defined. By 
choosing Vsize appropriate to the task at hand, we can effectively take advantage of the 
diversifying properties of DPPs in situations where the DPP size model is a poor fit. 

As a side effect, this approach actually enables us to use fc-DPPs to build models with 
both negative and positive correlations. For instance, if Vsize indicates that there are likely 
to be either hundreds of picnickers in the park (on a nice day) or, otherwise, just a few, then 
knowing that there are fifty picnickers today implies that there are likely to be even more. 
Thus, /c-DPPs can yield more expressive models than DPPs in this sense as well. 



57 



5.2 Inference 



Of course, increasing the expressiveness of the DPP causes us to wonder whether, in doing so, 
we might have lost some of the convenient computational properties that made DPPs useful 
in the first place. Naively, this seems to be the case; for instance, while the normalizing 



constant for a DPP can be written in closed form, the sum in Equation (175) is exponential 
and seems hard to simplify. In this section, we will show how fc-DPP inference can in fact be 
performed efficiently, using recursions for computing the elementary symmetric polynomials. 

5.2.1 Normalization 

Recall that the kth elementary symmetric polynomial on Ai, A2 . . . , A^r is given by 

efc(Ai,A2,...,AAr)= J] A, . (177) 

For instance. 



JC{1,2,...,N} neJ 

IJI=fe 



ei(Ai,A2,A3) = A1 + A2 + A3 (178) 
e2(Ai, A2, A3) = A1A2 + A1A3 + A2A3 (179) 
e3(Ai,A2,A3) = A1A2A3. (180) 

Proposition 5.1. The normalization constant for a k-DPP is 

Zk^ ^ det(Ly/) = efc(Ai, A2, . . . , Aat) , (181) 

\Y'\=k 

where Ai, A2, . . . , Aat are the eigenvalues of L. 

Proof. One way to see this is to examine the characteristic polynomial of L, det(L — A/) 
[Gerfand , 1989| . We can also show it directly using properties of DPPs. Recalhng that 



^ det(Ly) = det(L + /) , (182) 

we have 

det(Ly/) = det(L + /) Vl{Y') , (183) 

\Y'\=k \Y'\=k 



where Vl is the DPP with kernel L. Applying Lemma 2.6, which expresses any DPP as a 
mixture of elementary DPPs, we have 

det(L + /) -PLiY')^ J2 E 'P'^'iY')]!^- (184) 

\Y'\=k \Y'\=kJC{l,2,...,N} neJ 

= E E (185) 

\J\=k\Y'\=k neJ 

= En^- (186) 

\J\=kneJ 



where we use Lemma 2.7 in the last two steps to conclude that V^-^iY') = unless \ J\ — \Y'\. 



(Recall that Vj is the set of eigenvectors of L associated with A^^ for n E J.) □ 
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Algorithm 7 Computing the elementary symmetric polynomials 



Input: k, eigenvalues Ai, A2, . . . A^^ 
eJJ^l Vn G {0,l,2,...,Ar} 
eO^O VI e {l,2,...,fc} 
for I = 1,2, ...fc do 
forn = 1,2, . . . ,iV do 

end for 
end for 

Output: efc(Ai, A2, . . . , A^v) = 



To compute the kth elementary symmetric polynomial, we can use the recursive algorithm 
given in Algorithm [7[ which is based on the observation that every set of k eigenvalues 
either omits Aat, in which case we must choose k of the remaining eigenvalues, or includes 
A AT, in which case we get a factor of A at and choose only /c — 1 of the remaining eigenvalues. 
Formally, letting be a shorthand for e/e(Ai, A2, . . . , Aat), we have 

= et' + Aivef.l^ • (187) 

Note that a variety of recursions for computing elementary symmetric polynomials exist, 
including Newton's identities, the Difference Algorithm, and the Summation Algorithm 
[Baker and Harwell , 1996]. Algorithm [t] is essentially the Summation Algorithm, which is 



both asymptotically faster and numerically more stable than the other two, since it uses 
only sums and does not rely on precise cancellation of large numbers. 

Algorithm (tI runs in time 0{Nk). Strictly speaking, the inner loop need only iterate up 



to — + / in order to obtain at the end; however, by going up to N we compute all 



of the preceding elementary symmetric polynomials along the way. Thus, by running 
Algorithm [7| with k — N we can compute the normalizers for fc-DPPs of every size in time 
0{N'^). This can be useful when k is not known in advance. 

5.2.2 Sampling 

Since a fc-DPP is just a DPP conditioned on size, we could sample a fc-DPP by repeatedly 
sampling the corresponding DPP and rejecting the samples until we obtain one of size k. To 



make this more efficient, recall from Section 2.4.4 that the standard DPP sampling algorithm 



proceeds in two phases. First, a subset V of the eigenvectors of L is selected at random, 
and then a set of cardinality |V^| is sampled based on those eigenvectors. Since the size of a 
sample is fixed in the first phase, we could reject the samples before the second phase even 
begins, waiting until we have \V\ = k. However, rejection sampling is likely to be slow. It 
would be better to directly sample a set V conditioned on the fact that its cardinality is 
k. In this section we show how sampling k eigenvectors can be done efficiently, yielding a 
sampling algorithm for fc-DPPs that is asymptotically as fast as sampling standard DPPs. 

We can formalize the intuition above by rewriting the /c-DPP distribution in terms of 
the corresponding DPP: 

Vt{y)-^det{L + I)VL{Y) (188) 
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Algorithm 8 Sampling k eigenvectors 



Input: A:, eigenvalues Ai, A2, . . . , Aat 

compute for / = 0, 1, . . . , /c and n = 0, 1, . . . , (Algorithm [t]) 
l^k 

forn = A^, . . . , 2, 1 do 
if / = then 

break 
end if 

if ur^U[0,l] <Xn-^ then 
J ^ JU{n} 
l^l-l 
end if 
end for 
Output: J 



whenever |y| = fc, where we replace the DPP normalization constant with the fc-DPP 



normalization constant using Proposition |5.1[ Applying Lemma |2.6| and Lemma |2.7| to 
decompose the DPP into elementary parts yields 

\j\=k neJ 

Therefore, a fc-DPP is also a mixture of elementary DPPs, but it only gives nonzero weight 
to those of dimension k. Since the second phase of DPP sampling provides a means for 
sampling from any given elementary DPP, we can sample from a fc-DPP if we can sample 
index sets J according to the corresponding mixture components. Like normalization, this 
is naively an exponential task, but we can do it efficiently using the recursive properties of 
elementary symmetric polynomials. 

Theorem 5.2. Let J be the desired random variable^ so that Pr(J = J) = '^YlneJ^'^ 
when \J\ — k, and zero otherwise. Then Algorithm^ yields a sample for J. 

Proof. If = 0, then Algorithm |8] returns immediately at the first iteration of the loop with 
J = 0, which is the only possible value of J. 

If = 1 and A: = 1, then J must contain the single index 1. We have e\ = Ai and 

= 1, thus Ai^ = 1, and Algorithm jsj returns J = {1} with probability 1. 
We proceed by induction and compute the probability that Algorithm [8] returns J for 
> 1 and 1 < A: < A^. By inductive hypothesis, if an iteration of the loop in Algorithm [8] 

begins with n < N and < / < n, then the remainder of the algorithm adds to J a set of 

elements J' with probability 

^ n V (190) 

if I J'l = /, and zero otherwise. 
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Now suppose that J contains N ^ J — J' [J {N}- Then N must be added to J in the 
first iteration of the loop, which occurs with probability ^ ^ . The second iteration then 

begins with n — N — 1 and I — k — 1. If / is zero, we have the immediate base case; otherwise 
we have 1 < / < n. By the inductive hypothesis, the remainder of the algorithm selects J' 
with probability 

^ n (191) 



if I J' I = k — 1, and zero otherwise. Thus Algorithm [s] returns J with probabihty 

^n^-=;^n^- (192) 




if I J| = fc, and zero otherwise. 

On the other hand, if J does not contain then the first iteration must add nothing to 
J; this happens with probability 



giV-l N-l 

Aiv^ = Ar. (193) 



where we use the fact that — XNe^_i ^ ^' "^^^ second iteration then begins with 



n = N — 1 and / = k. We observe that if — 1 < A:, then Equation (193) is equal to zero, 
since ef = whenever / > n. Thus almost surely the second iteration begins with k < and 
we can apply the inductive hypothesis. This guarantees that the remainder of the algorithm 
chooses J with probability 



^ n (194) 

whenever \J\ = k. The overall probability that Algorithm returns J is therefore 




nGJ ^k fi^j 

if I J| = k, and zero otherwise. □ 



Algorithm [s] precomputes the values of ej, . . . , e^, which requires 0{Nk) time using 
Algorithm [7| The loop then iterates at most N times and requires only a constant number 



of operations, so Algorithm [8] runs in 0{Nk) time overall. By Equation (189), selecting J 
with Algorithm [s] and then sampling from the elementary DPP V^-^ generates a sample from 



the fc-DPP. As shown in Section [2.4.4 , sampling an elementary DPP can be done in 0{Nk^) 
time (see the second loop of Algorithm [l]) , so sampling fc-DPPs is 0{Nk^) overall, assuming 
we have an eigendecomposition of the kernel in advance. This is no more expensive than 
sampling a standard DPP. 
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5.2.3 Mar ginalizat ion 

Since fc-DPPs are not DPPs, they do not in general have marginal kernels. However, we can 
still use their connection to DPPs to compute the marginal probability of a set A, \A\ < k: 

Vl{A C = ^ Vl{Y' U A) (196) 



\Y'\ = k-\A\ 

det(L + /) 



Vl{Y' U A) (197) 



\Y'\=k-\A\ 

Any'=0 



"^^^^^^^^ Vl{Y ^Y'yjA\A(ZY)VL{A(ZY) (198) 

I 

Vl{A(ZY), (199) 



Zt\A\ det(L + /) 



\Y'\ = k-\A\ 



Zk det(L^ + /)^ 



where L"^ is the kernel, given in Equation (42), of the DPP conditioned on the inclusion of 
A, and 

= det(L^ + /) J2 ^l{Y ^Y'yjA\A<ZY) (200) 

\Y'\ = k-\A\ 

Ar\Y'^^ 

= det(L^,) (201) 

\Y'\^k-\A\ 

AnY'=% 

is the normalization constant for the {k — I^D-DPP with kernel L^. That is, the marginal 
probabilities for a fc-DPP are just the marginal probabilities for a DPP with the same kernel. 



but with an appropriate change of normalizing constants. We can simplify Equation (199) 
by observing that 

det(L^) ^ Vl{A C Y) 
det(L + /) det(L^ + /) ' ^ ^ 

since the left hand side is the probability (under the DPP with kernel L) that A occurs by 
itself, and the right hand side is the marginal probability of A multiplied by the probability 
of observing nothing else conditioned on observing A: 1/ det(L^ + /). Thus we have 

Vl{A C r) = det(L^) = Zt\A\n{A) . (203) 

That is, the marginal probability of A is the probability of observing exactly A times the 
normalization constant when conditioning on A. Note that a version of this formula also 
holds for standard DPPs, but there it can be rewritten in terms of the marginal kernel. 



Singleton marginals Equations (199) and (203) are general but require computing large 
determinants and elementary symmetric polynomials, regardless of the size of A. Moreover, 
those quantities (for example, det(L^ + ^)) must be recomputed for each unique A whose 
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marginal probability is desired. Thus, finding the marginal probabilities of many small sets 
is expensive compared to a standard DPP, where we need only small minors of K. However, 
we can derive a more efficient approach in the special but useful case where we want to 
know all of the singleton marginals for a fc-DPP — for instance, in order to implement quality 
learning as described in Section [4. 2[ 

We start by using Equation ( |189[ ) to write the marginal probability of an item i in terms 
of a combination of elementary DPPs: 

^ ^) = ^ E ^""'(^ ^ ^) n ^n' . (204) 

\J\=k n'eJ 

Because the marginal kernel of the elementary DPP V^-^ is given by ^^^jVnV^, we have 
-Phi e 1^) = 4 E ( E(^ne.)' ) n V (205) 

\neJ ) n'eJ 

1 ^ 

= ^E(^ne.)' E n^-' (206) 

n=l J^{n},\J\=kn'^J 



= Y.{vle,fK^, (207) 

n=l 

where e'j^I^-^ = ek-i{Xi^ A2, . . . , A^_i, A^+i, . . . , Xn) denotes the (k — l)-order elementary 
symmetric polynomial for all eigenvalues of L except Xn- Note that XnC'^^-^/e^ is exactly 
the marginal probability that n ^ J when J is chosen using Algorithm [Sj in other words, 
the marginal probability of item i is the sum of the contributions (v^Si)'^ made by each 
eigenvector scaled by the respective probabilities that the eigenvectors are selected. The 
contributions are easily computed from the eigendecomposition of L, thus we need only 
and e'^^-^ for each value of n in order to calculate the marginals for all items in 0{N'^) time, 
or 0{ND) time if the rank of L is < TV. 

Algorithm 7 computes e^Zi — ^k-i process of obtaining e^, so naively we could 

run Algorithm^ times, repeatedly reordering the eigenvectors so that each takes a turn 
at the last position. To compute all of the required polynomials in this fashion would 
require 0{N'^k) time. However, we can improve this (for small k) to 0(A^ log(A^)A:^); to 
do so we will make use of a binary tree on N leaves. Each node of the tree corresponds 
to a set of eigenvalues of L; the leaves represent single eigenvalues, and an interior node 
of the tree represents the set of eigenvalues corresponding to its descendant leaves. (See 
Figure [13)) We will associate with each node the set of elementary symmetric polynomials 
ei(A), 62 (A), . . . , e/e(A), where A is the set of eigenvalues represented by the node. 

These polynomials can be computed directly for leaf nodes in constant time, and the 
polynomials of an interior node can be computed given those of its children using a simple 
recursion: 

k 

efc(Ai U A2) = ^ Q(Ai)efc_KA2) . (208) 
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Figure 13: Binary tree with N = 8 leaves; interior nodes represent their descendant leaves. 
Removing a path from leaf n to the root leaves logA^ subtrees that can be combined to 
compute e^^-^. 



Thus, we can compute the polynomials for the entire tree in 0(A^log(A^)A:^) time; this is 
sufficient to obtain at the root node. 

However, if we now remove a leaf node corresponding to eigenvalue n, we invalidate 



the polynomials along the path from the leaf to the root; see Figure [131 This leaves logA^ 
disjoint subtrees which together represent all of the eigenvalues of L, leaving out A^. We can 



now apply Equation (208) logA^ times to the roots of these trees in order to obtain ej^^-^ in 
0{log{N)k'^) time. If we do this for each value of n, the total additional time required is 
0(7Vlog(7V)/c2). 

The algorithm described above thus takes 0{N\og{N)k'^) time to produce the necessary 
elementary symmetric polynomials, which in turn allow us to compute all of the singleton 



marginals. This is a dramatic improvement over applying Equation (199) to each item 
separately. 

5.2.4 Conditioning 

Suppose we want to condition a fc-DPP on the inclusion of a particular set A. For 1^41 + |S| = k 
we have 

Vl{Y = AUB\ACY) oc VliY ^ AVJ B) (209) 

(xVl{Y ^ A[JB) (210) 

oc VLiY ^ AyjB\A(ZY) (211) 

oc det(L^) . (212) 

Thus the conditional fc-DPP is a A: — |74|-DPP whose kernel is the same as that of the 
associated conditional DPP. The normalization constant is Z^_^j^y We can condition on 
excluding A in the same manner. 
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5.2.5 Finding the mode 



Unfortunately, although fc-DPPs offer the efficient versions of DPP inference algorithms 
presented above, finding the most likely set Y remains intractable. It is easy to see that the 
reduction from Section 2.4.5 still applies, since the cardinality of the Y corresponding to an 
exact 3-cover, if it exists, is known. In practice we can utilize greedy approximations, like 
we did for standard DPPs in Section 14.2.11 



5.3 Experiments: image search 



We demonstrate the use of fc-DPPs on an image search task [Kulesza and Taskar, 2Qlla| 



The motivation is as follows. Suppose that we run an image search engine, where our 
primary goal is to deliver the most relevant possible images to our users. Unfortunately, the 
query strings those users provide are often ambiguous. For instance, a user searching for 
"Philadelphia" might be looking for pictures of the city skyline, street-level shots of buildings, 
or perhaps iconic sights like the Liberty Bell or the Love sculpture. Furthermore, even if 
we know the user is looking for a skyline photograph, he or she might specifically want 
a daytime or nighttime shot, a particular angle, and so on. In general, we cannot expect 
users to provide enough information in a textual query to identify the best image with any 
certainty. 

For this reason search engines typically provide a small array of results, and we argue 
that, to maximize the probability of the user being happy with at least one image, the 
results should be relevant to the query but also diverse with respect to one another. That is, 
if we want to maximize the proportion of users searching "Philadelphia" who are satisfied by 
our response, each image we return should satisfy a large but distinct subset of those users, 
thus maximizing our overall coverage. Since we want diverse results but also require control 
over the number of results we provide, a fc-DPP is a natural fit. 



5.3.1 Learning setup 

Of course, we do not actually run a search engine and do not have real users. Thus, in order 
to be able to evaluate our model using real human feedback, we define the task in a manner 
that allows us to obtain inexpensive human supervision via Amazon Mechanical Turk. We 
do this by establishing a simple binary decision problem, where the goal is to choose, given 
two possible sets of image search results, the set that is more diverse. Formally, our labeled 
training data comprises comparative pairs of image sets {{Y^^ ,Y^~)}J^i^ where set Y^^ is 
preferred over set Y^~ , = \Y^~\ = k. We can measure performance on this classification 
task using the zero-one loss, which is zero whenever we choose the correct set from a given 
pair, and one otherwise. 

For this task we employ a simple method for learning a combination of fc-DPPs that is 
convex and seems to work well in practice. Given a set Li, L2, . . . , of "expert" kernel 
matrices, which are fixed in advance, define the combination model 

D 

V^e-Y.^irl, (213) 

1=1 
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Figure 14: The logistic loss function. 



where — ^- Note that this is a combination of distributions, rather than a combination 

of kernels. We will learn 6 to optimize a logistic loss measure on the binary task: 



min m = ^log (l + e-^i'Po(yt^)-T'l(yt-)]^ 

^ t=l 

D 



(214) 



1=1 



where 7 is a hyperparameter that controls how aggressively we penalize mistakes. Intuitively, 
the idea is to find a combination of fc-DPPs where the positive sets Y^^ receive higher 
probability than the corresponding negative sets . By using the logistic loss (Figure 14), 
which acts like a smooth hinge loss, we focus on making fewer mistakes. 



Because Equation (214) is convex in 9 (it is the composition of the convex logistic 
loss function with a linear function of 0), we can optimize it efficiently using projected 
gradient descent, where we alternate taking gradient steps and projecting on the constraint 



— 1- The gradient is given by 



t=l 



where 5* is a vector with entries 



St- 



(215) 



(216) 



Projection onto the simplex is achieved using standard algorithms Bertsekas, 19991 



5.3.2 Data 

We create datasets for three broad image search categories, using 8-12 hand-selected queries 
for each category. (See Table [oj) For each query, we retrieve the top 64 results from Google 
Image Search, restricting the search to JPEG files that pass the strictest level of Safe Search 
filtering. Of those 64 results, we eliminate any that are no longer available for download. 
On average this leaves us with 63.0 images per query, with a range of 59-64. 
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CARS 


CITIES 


DOGS 


Chrysler 


baltimore 


beagle 


ford 


barcelona 


bernese 


honda 


london 


blue heeler 


mercedes 


los angeles 


cocker spaniel 


mitsubishi 


miami 


collie 


nissan 


new york city 


great dane 


porsche 


paris 


labrador 


toyota 


Philadelphia 


pomeranian 




san francisco 


poodle 




shanghai 


pug 




tokyo 


schnauzer 




toronto 


shih tzu 



Table 6: Queries used for data collection. 



We then use the downloaded images to generate 960 training instances for each category, 
spread evenly across the different queries. In order to compare fc-DPPs directly against 
baseline heuristic methods that do not model probabilities of full sets, we generate only 
instances where Y^^ and Yf differ by a single element. That is, the classification problem is 
effectively to choose which of two candidate images if^il is a less redundant addition to a 
given partial result set Yf. 

Y+^YtU{it} Yf^YtU{i^}. (217) 

In our experiments Yf contains five images, so /c = = \Y^~\ — 6. We sample partial 
result sets using a fc-DPP with a SIFT-based kernel (details below) to encourage diversity. 
The candidates are then selected uniformly at random from the remaining images, except for 
10% of instances that are reserved for measuring the performance of our human judges. For 
those instances, one of the candidates is a duplicate image chosen uniformly at random from 
the partial result set, making it the obviously more redundant choice. The other candidate 
is chosen as usual. 

In order to decide which candidate actually results in the more diverse set, we collect 
human diversity judgments using Amazon's Mechanical Turk. Annotators are drawn from 
the general pool of Turk workers, and are able to label as many instances as they wish. 
Annotators are paid $0.01 USD for each instance that they label. For practical reasons, we 
present the images to the annotators at reduced scale; the larger dimension of an image is 
always 250 pixels. The annotators are instructed to choose the candidate that they feel is 
"less similar" to the images in the partial result set. We do not offer any specific guidance on 
how to judge similarity, since dealing with uncertainty in human users is central to the task. 



The candidate images are presented in random order. Figure [15] shows a sample instance 
from each category. 

Overall, we find that workers choose the correct image for 80.8% of the calibration 
instances (that is, they choose the one not belonging to the partial result set). This suggests 
only moderate levels of noise due to misunderstanding, inattention or robot workers. However, 
for non-calibration instances the task is inherently difficult and subjective. To keep noise 
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Figure 15: Sample labeling instances from each search category. The five images on the left 
form the partial result set, and the two candidates are shown on the right. The candidate 
receiving the majority of annotator votes has a blue border. 

in check, we have each instance labeled by five independent judges, and keep only those 
instances where four or more judges agree. In the end this leaves us with 408-482 labeled 
instances per category, or about half of the original instances. 



5.3.3 Kernels 

We define a set of 55 "expert" similarity kernels for the collected images, which form the 
building blocks of our combination model and baseline methods. Each kernel is the 
Gram matrix of some feature function /; that is, Lfj = f{i) • f{j) for images i and j. We 
therefore specify the kernels through the feature functions used to generate them. All of our 
feature functions are normalized so that = 1 for all z; this ensures that no image is 

a priori more likely than any other. Implicitly, thinking in terms of the decomposition in 



Section |3.1[ we are assuming that all of the images in our set are equally relevant in order 
to isolate the modeling of diversity. This assumption is at least partly justified by the fact 
the our images come from actual Google searches, and are thus presumably relevant to the 
query. 

We use the following feature functions, which derive from standard image processing 
and feature extraction methods: 

• Color (2 variants): Each pixel is assigned a coordinate in three-dimensional Lab color 
space. The colors are then sorted into axis-aligned bins, producing a histogram of 
either 8 or 64 dimensions. 

• SIFT (2 variants): The images are processed with the vlf eat toolbox to obtain sets 
of 128-dimensional SIFT descriptors [Lowef |1999[ [Vedaldi and Fulkerson] |2008| . The 
descriptors for a given category are combined, subsampled to a set of 25,000, and 
then clustered using fc-means into either 256 or 512 clusters. The feature vector for 
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an image is the normalized histogram of the nearest clusters to the descriptors in the 
image. 



GIST: The images are processed using code from Oliva and Torralba [2006| to 



yield 960-dimensional GIST feature vectors characterizing properties like "openness," 
"roughness," "naturalness," and so on. 

In addition to the five feature functions described above, we include another five that are 
identical but focus only on the center of the image, defined as the centered rectangle with 
dimensions half those of the original image. This gives our first ten kernels. We then create 
45 pairwise combination kernels by concatenating every possible pair of the 10 basic feature 
vectors. This technique produces kernels that synthesize more than one source of information, 
offering greater fiexibility. 

Finally, we augment our kernels by adding a constant hyperparameter p to each entry. 
p acts a knob for controlling the overall preference for diversity; as p increases, all images 
appear more similar, thus increasing repulsion. In our experiments, p is chosen independently 
for each method and each category to optimize performance on the training set. 

5.3.4 Methods 

We test four different methods. Two use fc-DPPs, and two are derived from Maximum 
Marginal Relevance (MMR) [Carbonell and Goldstein , |l998| . For each approach, we test 



both the single best expert kernel on the training data, as well as a learned combination 
of kernels. All methods were tuned separately for each of the three query categories. On 
each run a random 25% of the labeled examples are reserved for testing, and the remaining 
75% form the training set used for setting hyperparameters and training. Recall that Yt is 
the five-image partial result set for instance t, and let Ct — {if ^ i^} denote the set of two 
candidates images, where if is the candidate preferred by the human judges. 

Best fc-DPP Given a single kernel L, the fc-DPP prediction is 

/cDPPt = argmax Vl{Yt U {i}) . (218) 

i^Ct 

We select the kernel with the best zero-one accuracy on the training set, and apply it to the 
test set. 

Mixture of fc-DPPs We apply our learning method to the full set of 55 kernels, optimizing 



Equation (214) on the training set to obtain a 55-dimensional mixture vector 9. We set 7 
to minimize the zero-one training loss. We then take the learned 9 and apply it to making 
predictions on the test set: 



55 



/cDPPmixt = arg max ^ 9iVl^ {Yt U {i}) . (219) 
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Best MMR Recall that MMR is a standard, heuristic technique for generating diverse 
sets of search results. The idea is to build a set iteratively by adding on each round a 
result that maximizes a weighted combination of relevance (with respect to the query) and 
diversity, measured as the maximum similarity to any of the previously selected results. 
(See Section 4.2.1 for more details about MMR.) For our experiments, we assume relevance 



is uniform; hence we merely need to decide which of the two candidates has the smaller 
maximum similarity to the partial result set. Thus, for a given kernel L, the MMR prediction 



IS 



MMR. 



arg mm 

ieCt 



max L 



(220) 



As for the fc-DPP, we select the single best kernel on the training set, and apply it to the 
test set. 



Mixture MMR We can also attempt to learn a mixture of similarity kernels for MMR. We 
use the same training approach as for fc-DPPs, but replace the probability score PgiYy U {i}) 
with the negative cost 

D 

- ce{Yt, z) = - max V 0^[L/],, , (221) 
^^"^ ^=1 

which is just the negative similarity of item i to the set Yt under the combined kernel metric. 
Significantly, this substitution makes the optimization non-smooth and non-convex, unlike 
the fc-DPP optimization. In practice this means the global optimum is not easily found. 
However even a local optimum may provide advantages over the single best kernel. In our 
experiments we use the local optimum found by projected gradient descent starting from 
the uniform kernel combination. 



5.3.5 Results 

Table [7] shows the mean zero-one accuracy of each method for each query category, averaged 
over 100 random train/test splits. Statistical significance is computed by bootstrapping. 
Regardless of whether we learn a mixture, fc-DPPs outperform MMR on two of the three 
categories, significant at 99% confidence. In all cases, the learned mixture of fc-DPPs achieves 
the best performance. Note that, because the decision being made for each instance is 
binary, 50% is equivalent to random performance. Thus the numbers in Table [t] suggest that 
this is a rather difficult task, a conclusion supported by the rates of noise exhibited by the 
human judges. However, the changes in performance due to learning and the use of fc-DPPs 
are more obviously significant when measured as improvements above this baseline level. 
For example, in the cars category our mixture of fc-DPPs performs 14.58 percentage points 



better than random, versus 9.59 points for MMR with a mixture of kernels. Figure 16 shows 
some actual samples drawn using the fc-DPP sampling algorithm. 

Table [8] shows, for the fc-DPP mixture model, the kernels receiving the highest weights 
for each search category (on average over 100 train/test splits). Combined-feature kernels 
appear to be useful, and the three categories exhibit significant differences in what annotators 
deem diverse, as we might expect. 
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Best 


Best 


Mixture 


Mixture 


Category 


MMR 


/c-DPP 


MMR 


fc-DPP 


CARS 


55.95 


57.98 


59.59 


64.58 


CITIES 


56.48 


56.31 


60.99 


61.29 


DOGS 


56.23 


57.70 


57.39 


59.84 



Table 7: Percentage of real- world image search examples judged the same way as the majority 
of human annotators. Bold results are significantly higher than others in the same row with 
99% confidence. 



porsche 

k=2 




k=4 













"Philadelphia" 

k=2 




k=4 




"cocker spaniel" 




k=4 




Figure 16: Samples from the fc-DPP mixture model. 
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color- 8- center & sift- 256 




CARS 


color- 8- center <k sirt-olz 


(0.11) 




coior-o- center 






sift-512-center 


(0.85) 


CITIES 


gist 


(0.08) 




color-8-center & gist 


(0.03) 




color-8-center 


(0.39) 


DOGS 


color-8-center & sift-512 


(0.21) 




color-8-center & sift-256 


(0.20) 



Table 8: Kernels receiving the highest average weights for each category (shown in parenthe- 
ses). Ampersands indicate kernels generated from pairs of feature functions. 

Single kernel Uniform MMR 

Category (average) mixture mixture 
CARS 57.58 68.31 58.15 
CITIES 59.00 64.76 62.32 

DOGS 57.78 62.12 57.86 

Table 9: The percentage of virtual users whose desired image is more similar to the fc-DPP 
results than the MMR results. Above 50 indicates better /c-DPP performance; below 50 
indicates better MMR performance. The results for the 55 individual expert kernels are 
averaged in the first column. 



We can also return to our original motivation and try to measure how well each method 
"covers" the space of hkely user intentions. Since we do not have access to real users who are 
searching for the queries in our dataset, we instead simulate them by imagining that each is 
looking for a particular target image drawn randomly from the images in our collection. For 
instance, given the query "Philadelphia" we might draw a target image of the Love sculpture, 
and then evaluate each method on whether it selects an image of the Love sculpture, i.e., 
whether it satisfies that virtual user. More generally, we will simply record the maximum 
similarity of any image in the result set to the target image. We expect better methods to 
show higher similarity when averaged over a large number of such users. 

We consider only the mixture models here, since they perform best. For each virtual 
user, we sample a ten-image result set Idpp using the mixture /c-DPP, and select a second 
ten-image result set Immr using the mixture MMR. For MMR, the first image is selected 
uniformly at random, since they are assumed to be uniformly relevant. Subsequent selections 
are deterministic. Given a target image i drawn uniformly at random, we then compute 
similarities 

SDPp{i) = max Lij 5mmr(0 = max Lij (222) 

for a particular similarity kernel L. We report the fraction of the time that 5DPp(i) > 5mmr(0; 
that is, the fraction of the time that our virtual user would be better served by the /c-DPP 
model. Because we have no gold standard kernel L for measuring similarity, we try several 
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possibilities, including all 55 expert kernels, a uniform combination of the expert kernels, 
and the combination learned by MMR. (Note that the mixture fc-DPP does not learn a 
kernel combination, hence there is no corresponding mixture to try here.) Table [o] shows the 
results, averaged across all of the virtual users (i.e., all the images in our collection). Even 
when using the mixture learned to optimize MMR itself, the fc-DPP does a better job of 
covering the space of possible user intentions. All results in Table [9] are significantly higher 
than 50% at 99% confidence. 



6 Structured DPPs 

We have seen in the preceding sections that DPPs offer polynomial-time inference and 
learning with respect to the number of items in the ground set 3^. This is important 
since DPPs model an exponential number of subsets Y C so naive algorithms would 
be intractable. And yet, we can imagine DPP applications for which even linear time is 
too slow. For example, suppose that after modeling the positions of basketball players, 
as proposed in the previous section, we wanted to take our analysis one step further. An 
obvious extension is to realize that a player does not simply occupy a single position, but 
instead moves around the court over time. Thus, we might want to model not just diverse 
sets of positions on the court, but diverse sets of paths around the court during a game. 
While we could reasonably discretize the possible court positions to a manageable number 
M, the number of paths over, say, 100 time steps would be M^^^, making it almost certainly 
impossible to enumerate them all, let alone build an M^^^ x M^^^ kernel matrix. 

However, in this combinatorial setting we can take advantage of the fact that, even 
though there are exponentially many paths, they are structured; that is, every path is built 
from a small number of the same basic components. This kind of structure has frequently 
been exploited in machine learning, for example, to find the best translation of a sentence, 
or to compute the marginals of a Markov random field. In such cases structure allows us to 
factor computations over exponentially many possibilities in an efficient way. And yet, the 
situation for structured DPPs is even worse: when the number of items in y is exponential, 
we are actually modeling a distribution over the doubly exponential number of subsets of 
an exponential y. If there are M^^^ possible paths, there are 2^ subsets of paths, and a 
DPP assigns a probability to every one. This poses an extreme computational challenge. 

In order to develop efficient structured DPPs (SDPPs), we will therefore need to combine 
the dynamic programming techniques used for standard structured prediction with the 
algorithms that make DPP inference efficient. We will show how this can be done by 



applying the dual DPP representation from Section 3.3, which shares spectral properties 
with the kernel L but is manageable in size, and the use of second-order message passing, 
where the usual sum-product or min-sum semiring is replaced with a special structure that 



computes quadratic quantities over a factor graph |Li and Eisner , 2009 . In the end, we will 



demonstrate that it is possible to normalize and sample from an SDPP in polynomial time. 

Structured DPPs open up a large variety of new possibilities for applications; they allow 
us to model diverse sets of essentially any structured object. For instance, we could find 
not only the best translation but a diverse set of high-quality translations for a sentence, 
perhaps to aid a human translator. Or, we could study the distinct proteins coded by a 
gene under alternative RNA splicings, using the diversifying properties of DPPs to cover 
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the large space of possibilities with a small representative set. Later, we will apply SDPPs 
to three real-world tasks: identifying multiple human poses in images, where there are 
combinatorially many possible poses, and we assume that the poses are diverse in that they 
tend not to overlap; identifying salient lines of research in a corpus of computer science 
publications, where the structures are citation chains of important papers, and we want 
to find a small number of chains that covers the major topic in the corpus; and building 
threads from news text, where the goal is to extract from a large corpus of articles the most 
significant news stories, and for each story present a sequence of articles covering the major 
developments of that story through time. 

We begin by defining SDPPs and stating the structural assumptions that are necessary 
to make inference efficient; we then show how these assumptions give rise to polynomial-time 
algorithms using second order message passing. We discuss how sometimes even these 
polynomial algorithms can be too slow in practice, but demonstrate that by applying the 



technique of random projection (Section 3.4) we can dramatically speed up computation and 



reduce memory use while maintaining a close approximation to the original model |Kulesza 



and TaskarllMo). Finally, we show how SDPPs can be applied to the experimental settings 



described above, yielding improved results compared with a variety of standard and heuristic 
baseline approaches. 

6.1 Factorization 



In Section [2^ we saw that DPPs remain tractable on modern computers for up to around 
10,000. This is no small feat, given that the number of subsets of 10,000 items is roughly the 
number of particles in the observable universe to the 40th power. Of course, this is not magic 
but simply a consequence of a certain type of structure; that is, we can perform inference 
with DPPs because the probabilities of these subsets are expressed as combinations of only a 
relatively small set of 0{N'^) parameters. In order to make the jump now to ground sets 3^ 
that are exponentially large, we will need to make an similar assumption about the structure 
of y itself. Thus, a structured DPP (SDPP) is a DPP in which the ground set 3^ is given 
implicitly by combinations of a set of parts. For instance, the parts could be positions on 
the court, and an element of 3^ a sequence of those positions. Or the parts could be rules of 
a context-free grammar, and then an element of y might be a complete parse of a sentence. 
This assumption of structure will give us the algorithmic leverage we need to efficiently work 
with a distribution over a doubly exponential number of possibilities. 

Because elements of y are now structures, we will no longer think of 3^ = {1, 2, ... , N}] 
instead, each element y G 3^ is a structure given by a sequence of R parts (yi, ^2, • • • , VrIi 
each of which takes a value from a finite set of M possibilities. For example, if y is the path 
of a basketball player, then R is the number of time steps at which the player's position is 
recorded, and i/r is the player's discretized position at time r. We will use to denote the 
zth structure in 3^ under an arbitrary ordering; thus 3^ = {2/1,2/25 • • • 5 Vn}^ where N = M^. 
The parts of y^ are denoted i/ir. 

An immediate challenge is that the kernel L, which has N'^ entries, can no longer 
be written down explicitly. We therefore define its entries using the quality /diversity 



decomposition presented in Section 3.1, Recall that this decomposition gives the entries of 
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L as follows: 

Lij = q{yi)<t^{yi)'(t>{yMy3) ' (223) 

where qiyi) is a nonnegative measure of the quality of structure y^, and <j){yi) is a D- 
dimensional vector of diversity features so that (j){y^)^ (j){yj) is a measure of the similarity 
between structures and yj. We cannot afford to specify q and (j) for every possible 
structure, but we can use the assumption that structures are built from parts to define a 
factorization, analogous to the factorization over cliques that gives rise to Markov random 
fields. 

Specifically, we assume the model decomposes over a set of factors F, where a factor a ^ F 
is a small subset of the parts of a structure. (Keeping the factors small will ensure that the 
model is tractable.) We denote by y^ the collection of parts of y that are included in factor 
a; then the factorization assumption is that the quality score decomposes multiplicatively 
over parts, and the diversity features decompose additively: 

q{y) = n lM (224) 
cf>iy)^Y.'>M- (225) 

We argue that these are quite natural factorizations. For instance, in our player tracking 
example we might have a positional factor for each time r, allowing the quality model to 
prefer paths that go through certain high-traffic areas, and a transitional factor for each pair 
of times (r — 1, r), allowing the quality model to enforce the smoothness of a path over time. 
More generally, if the parts correspond to cliques in a graph, then the quality scores can be 
given by a standard log-linear Markov random field (MRF), which defines a multiplicative 
distribution over structures that give labelings of the graph. Thus, while in Section [3^ we 
compared DPPs and MRFs as alternative models for the same binary labeling problems, 
SDPPs can also be seen as an extension to MRFs, allowing us to take a model of individual 
structures and use it as a quality measure for modeling diverse sets of structures. 

Diversity features, on the other hand, decompose additively, so we can think of them 
as global feature functions defined by summing local features, again as done in standard 
structured prediction. For example, (j)r{yr) could track the coarse-level position of a player 
at time r, so that paths passing through similar positions at similar times are less likely 
to co-occur. Note that, in contrast to the unstructured case, we do not generally have 
110(^)11 = 1, since there is no way to enforce such a constraint under the factorization in 



Equation (225). Instead, we simply set the factor features (paiVa) have unit norm for all 
a and all possible values of y^. This slightly biases the model towards structures that have 
the same (or similar) features at every factor, since such structures maximize \\(f)\\- However, 
the effect of this bias seems to be minor in practice. 

As for unstructured DPPs, the quality and diversity models combine to produce balanced, 
high-quality, diverse results. However, in the structured case the contribution of the diversity 
model can be especially significant due to the combinatorial nature of the items in 3^. For 
instance, imagine taking a particular high-quality path and perturbing it slightly, say by 
shifting the position at each time step by a small random amount. This process results in 
a new and distinct path, but is unlikely to have a significant effect on the overall quality: 



75 



the path remains smooth and goes through roughly the same positions. Of course, this 
is not unique to the structured case; we can have similar high-quality items in any DPP. 
What makes the problem especially serious here is that there is a combinatorial number of 
such slightly perturbed paths; the introduction of structure dramatically increases not only 
the number of items in 3^, but also the number of subtle variations that we might want to 
suppress. Furthermore, factored distributions over structures are often very peaked due to 
the geometric combination of quality scores across many factors, so variations of the most 
likely structure can be much more probable than any real alternative. For these reasons 
independent samples from an MRF can often look nearly identical; a sample from an SDPP, 
on the other hand, is much more likely to contain a truly diverse set of structures. 

6.1.1 Synthetic example: particle tracking 

Before describing the technical details needed to make SDPPs computationally efficient, we 
first develop some intuition by studying the results of the model as applied to a synthetic 
motion tracking task, where the goal is to follow a collection of particles as they travel in 
a one-dimensional space over time. This is essentially a simplified version of our player 
tracking example, but with the motion restricted to a line. We will assume that a path y 
has 50 parts, where each part E {1,2,..., 50} is the particle's position at time step r 
discretized into one of 50 locations. The total number of possible trajectories in this setting 
is 50^^, and we will be modeling 2^^ possible sets of trajectories. We define positional and 
transitional factors 

F = {{r} I r = 1, 2, . . . , 50} U {{r - 1, r} I r = 2, 3, . . . , 50} . (226) 

While a real tracking problem would involve quality scores q{y) that depend on some 
observations — for example, measurements over time from a set of physical sensors, or perhaps 
a video feed from a basketball game — for simplicity we determine the quality of a trajectory 
here using only its starting position and a measure of smoothness over time. Specifically, we 
have 

50 

q{y) = qi{yi) H ^(^^-i, yr) , (227) 

r=2 

where the initial quality score qi{yi) is given by a smooth trimodal function with a primary 
mode at position 25 and secondary modes at positions 10 and 40, depicted by the blue 
curves on the left side of Figure [TtJ and the quality scores for all other positional factors 
are fixed to one and have no effect. The transition quality is the same at all time steps, 
and given by q{yr-i, yr) — fj^iUr-i — Vr)^ where //v" is the density function of the normal 
distribution; that is, the quality of a transition is maximized when the particle does not 
change location, and decreases as the particle moves further and further from its previous 
location. In essence, high quality paths start near the central position and move smoothly 
through time. 

We want trajectories to be considered similar if they travel through similar positions, so 
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Sampled particle trajectories (position vs. time) 
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Figure 17: Sets of particle trajectories sampled from an SDPP (top row) and independently 
using only quality scores (bottom row). The curves to the left indicate quality scores for the 
initial positions of the particles. 



we define a 50-dimensional diversity feature vector as follows: 

50 



r=l 



Kl{yr) OC /a/-(/ - Vr), / = 1, 2, . . . , 50 . 



(228) 
(229) 



Intuitively, feature / is activated when the trajectory passes near position /, so trajectories 
passing through nearby positions will activate the same features and thus appear similar in 
the diversity model. Note that for simplicity, the time at which a particle reaches a given 
position has no effect on the diversity features. The diversity features for the transitional 
factors are zero and have no effect. 

We use the quality and diversity models specified above to define our SDPP. In order to 
obtain good results for visualization, we scale the kernel so that the expected number of 
trajectories in a sample from the SDPP is five. We then apply the algorithms developed 



later to draw samples from the model. The first row of Figure 17 shows the results, and 
for comparison each corresponding panel on the second row shows an equal number of 
trajectories sampled independently, with probabilities proportional to their quality scores. 
As evident from the figure, trajectories sampled independently tend to cluster in the middle 
region due to the strong preference for this starting position. The SDPP samples, however, 
are more diverse, tending to cover more of the space while still respecting the quality 
scores — they are still smooth, and still tend to start near the center. 
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6.2 Second-order message passing 

The central computational challenge for SDPPs is the fact that N = is exponentially 



large, making the usual inference algorithms intractable. We showed in Section [3^3] how 
DPP inference can be recast in terms of a smaller dual representation C; however, for this to 
be of any use here we must be able to efficiently compute C for an arbitrary SDPP. We next 
describe the second-order message passing algorithm that makes this possible. Afterwards, 
we will show how the ability to compute C and related quantities leads to polynomial-time 
algorithms for SDPP inference. 

Second-order message passing was first introduced by Li and Eisner |2009| . The main 



idea is to compute second-order statistics over a graphical model by using the standard belief 
propagation message passing algorithm, but with a special semiring in place of the usual 
sum-product or max-product. This substitution will make it possible to compute quantities 
of the form 

E (Upm) ( E ( E '"-(ya)] ' (230) 

where pa are nonnegative and aa and ba are arbitrary functions. Note that we can think of 
Pa as defining a multiplicatively decomposed function 

and aa and ba as defining corresponding additively decomposed functions a and b. Equa- 



tion (230) will apply directly to computing the dual representation C (see Section 3.3) for 
an SDPP. 

We begin by defining the notion of a factor graph, which provides the structure for all 
message passing algorithms. We then describe standard belief propagation on factor graphs, 
and show how it can be defined in a general way using semirings. Finally we demonstrate 



that belief propagation using the semiring proposed by Li and Eisner |2009| computes 



quantities of the form in Equation (230). 



6.2.1 Factor graphs 

Message passing operates on factor graphs. A factor graph is an undirected bipartite graph 
with two types of vertices: variable nodes and factor nodes. Variable nodes correspond to 
the parts of the structure being modeled; for the SDPP setup described above, a factor 
graph contains R variable nodes, each associated to a distinct part r. Similarly, each factor 
node corresponds to a distinct factor a ^ F. Every edge in the graph connects a variable 
node to a factor node, and an edge exists between variable node r and factor node a if and 
only if r G a. Thus, the factor graph encodes the relationships between parts and factors. 



Figure shows an example factor graph for the tracking problem from Section 6.1.1 



It is obvious that the computation of Equation (230) cannot be efficient when factors 
are allowed to be arbitrary, since in the limit a factor could contain all parts and we could 
assign arbitrary values to every configuration y. Thus we will assume that the degree of the 



factor nodes is bounded by a constant c. (In Figure 18, as well as all of the experiments 



we run, we have c = 2.) Furthermore, message passing algorithms are efficient whenever 
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Figure 18: A sample factor graph for the tracking problem. Variable nodes are circular, and 
factor nodes are square. Positional factors that depend only on a single part appear in the 
top row; binary transitional factors appear between parts in the second row. 



the factor graph has low treewidth, or, roughly, when only small sets of nodes need to be 
merged to obtain a tree. Going forward we will assume that the factor graph is a tree, since 
any low-treewidth factor graph can be converted into an equivalent factor tree with bounded 
factors using the junction tree algorithm (Lauritzen and Spiegelhalter , 1988] . 



6.2.2 Belief propagation 



We now describe the basic belief propagation algorithm, first introduced by Pearl |1982| . 
Suppose each factor has an associated real- valued weight function WaiVa)^ giving rise to the 
multiplicatively decomposed global weight function 



(232) 



Then the goal of belief propagation is to efficiently compute sums of w{y) over combinatorially 
large sets of structures y. 

We will refer to a structure y as an assignment to the variable nodes of the factor graph, 
since it defines a value for every part. Likewise we can think of y^ as an assignment 
to the variable nodes adjacent to a, and i/r as an assignment to a single variable node r. 
We use the notation y^ ^ yr to indicate that y^ is consistent with y^, in the sense that it 
assigns the same value to variable node r. Finally, denote by F{r) the set of factors in which 
variable r participates. 

The belief propagation algorithm defines recursive message functions m to be passed 
along edges of the factor graph; the formula for the message depends on whether it is 
traveling from a variable node to a factor node, or vice versa: 



• From a variable r to a factor a: 



rur 



x{yr 



n 



rrin 



(233) 



a'^F{r)-{a} 



• From a factor o: to a variable r: 



r' ^a—{r} 



(234) 
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Intuitively, an outgoing message summarizes all of the messages arriving at the source node, 
excluding the one coming from the target node. Messages from factor nodes additionally 
incorporate information about the local weight function. 

Belief propagation passes these messages in two phases based on an arbitrary orientation 
of the factor tree. In the first phase, called the forward pass, messages are passed upwards 
from the leaves to the root. In the second phase, or backward pass, the messages are passed 
downward, from the root to the leaves. Upon completion of the second phase one message 
has been passed in each direction along every edge in the factor graph, and it is possible to 
prove using an inductive argument that, for every y^, 

n ^a^t{yr)= Yl^c^^y^)- (235) 



If we think of the Wa as potential functions, then Equation (235) gives the (unnormalized) 
marginal probability of the assignment i/r under a Markov random field. 

Note that the algorithm passes two messages per edge in the factor graph, and each 
message requires considering at most assignments, therefore its running time is 0{M^R). 



The sum on the right-hand side of Equation (235), however, is exponential in the number of 



parts. Thus belief propagation offers an efficient means of computing certain combinatorial 
quantities that would naively require exponential time. 

6.2.3 Semirings 

In fact, the belief propagation algorithm can be easily generalized to operate over an arbitrary 
semiring, thus allowing the same basic algorithm to perform a variety of useful computations. 
Recall that a semiring (VF, ©, 0, 0, 1) comprises a set of elements VF, an addition operator ©, 
a multiplication operator ®, an additive identity 0, and a multiplicative identity 1 satisfying 
the following requirements for all a, 6, c G W\ 

• Addition is associative and commutative, with identity 0: 

a®{h®c)^{a®h)®c (236) 
a © 6 = 6 © a (237) 
a © = a (238) 

• Multiplication is associative, with identity 1: 

a®{h®c)^{a®h)®c (239) 
a©l = l©a = a (240) 

• Multiplication distributes over addition: 

a®{h®c)^{a®h)®{a®c) (241) 
{a®h)®c^{a®c)®{h®c) (242) 

• is absorbing under multiplication: 

a©0 = 0©a = (243) 
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Obviously these requirements are met when = R and multiphcation and addition are the 
usual arithmetic operations; this is the standard sum-product semiring. We also have, for 
example, the max-product semiring, where W = [0, oc), addition is given by the maximum 
operator with identity element 0, and multiplication is as before. 

We can rewrite the messages defined by belief propagation in terms of these more general 
operations. For WaiVa) ^ we have 



rrir- 



a'eF{r)-{a} 



(244) 



(245) 



Va^Ur r'ea-{r} 

As before, we can pass messages forward and then backward through the factor tree. 
Because the properties of semirings are sufficient to preserve the inductive argument, we 
then have the following analog of Equation (|235|): 



rrin 



aeF(r) 



(246) 



We have seen that Equation (246) computes marginal probabilities under the sum-product 



semiring, but other semirings give rise to useful results as well. Under the max-product 



semiring, for instance. Equation (246) is the so-called max-marginal — the maximum unnor- 



malized probability of any single assignment y consistent with y^. In the next section we 
take this one step further, and show how a carefully designed semiring will allow us to sum 
second-order quantities across exponentially many structures y. 



6.2.4 Second-order semiring 



Li and Eisner [2009 proposed the following second-order semiring over four-tuples (g, 0, '0, c) E 

(gi, '^1, Ci) (^2, </>2, ^2, C2) = {qi + ^2, </>l + </>2, ^1 + ^2, Ci + C2) (247) 
(gi, 01, '^/^l, Ci) (^2, 02,^2, C2) = (^1^2, qih + q2(t)l, g'1'02 + g'2'01, 

qiC2 + q2Ci + 01^2 + 02'0i) (248) 

= (0,0,0,0) (249) 

1 = (1,0,0,0) (250) 

It is easy to verify that the semiring properties hold for these operations. Now, suppose that 
the weight function for a factor a is given by 

a)^a{y a) 1 Pa 

where pa, da, and ba are as before. Then WaiVa) G W, and we can get some intuition about 
the multiphcation operator by observing that the fourth component of Wa{ya) ® Wa'iVa') is 

PaiVa) [Pa'(2/a')aa'(2/a')V(2/a')] +Pa'(?/a') ba(?/a)«a(?/a)^>a(?/a)] 

+ {PaiVaWiya)] [Pa'{ya')ba'{ya')] + [Pa'(?/a')«a'(2/a')] {Paiy^baiVa)] (252) 
= Paiya)Pa'{ya') [««(?/«) + OaKl/a')] [baiyj + ^a'(2/a')] • (253) 
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In other words, multiplication in the second-order semiring combines the values of p multi- 
plicatively and the values of a and h additively, leaving the result in the fourth component. 
It is not hard to extend this argument inductively and show that the fourth component of 
®cx^F^ci{ya) is given in general by 

( n^-(^c.)) (E^-(^c.)) (E^-(^-)) • (254) 



Thus, by Equation (246) and the definition of ©, belief propagation with the second-order 



semiring yields messages that satisfy 



aeF(r) 



E n P'^iV'^) E «"(^^«) E ^"(?/") • (255) 

y^yr \aeF ) \aeF ) \aeF / 



Note that multiplication and addition remain constant-time operations in the second-order 
semiring, thus belief propagation can still be performed in time linear in the number of 
factors. In the following section we will show that the dual representation C, as well as 



related quantities needed to perform inference in SDPPs, takes the form of Equation (255); 



thus second-order message passing will be an important tool for efficient SDPP inference. 
6.3 Inference 



The factorization proposed in Equation (225) gives a concise definition of a structured DPP 
for an exponentially large 3^; remarkably, under suitable conditions it also gives rise to 
tractable algorithms for normalizing the SDPP, computing marginals, and sampling. The 
only restrictions necessary for efficiency are the ones we inherit from belief propagation: the 
factors must of bounded size so that we can enumerate all of their possible configurations, 
and together they must form a low-treewidth graph on the parts of the structure. These 
are precisely the same conditions needed for efficient graphical model inference [Roller a"nd 



Friedman] |2009 || , which is generalized by inference in SDPPs. 



6.3.1 Computing C 



As we saw in Section 3.3, the dual representation C is sufficient to normalize and marginalize 
an SDPP in time constant in N. However, for this equivalence to be of any use, we must be 
able to compute C for an arbitrary SDPP. As a first attempt, we can write 

C = BB~^ (256) 

= (257) 
yey 

If we think of QaiVa) ^^e factor potentials of a graphical model p{y) (x Yl^^eF ^aiVa)^ 
then computing C is equivalent to computing second moments of the diversity features 
under p (up to normalization) . Since the diversity features factor additively, C is quadratic 
in the local diversity features (^aiVa)- Thus, we could naively calculate C by computing 
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the pairwise marginals piy^^, jJc^/) for all realizations of the factors a, ol and, by linearity of 
expectations, adding up their contributions: 

Oil Voi'jr^c^ {ya)^OL'{ya'V 5 (258) 

where the proportionality is due to the normalizing constant of p{y). However, this sum 
is quadratic in the number of factors and their possible realizations, and can therefore be 
expensive when structures are large. 



Instead, we can substitute the factorization from Equation (225) into Equation (257) to 
obtain ^ 

^ = E ( n ««(^«) ) ( E '^«(^«) ) ( E '^«(^«) ) • (259) 



This sum appears daunting, but we can recognize its form from Equation (255) — it is 
precisely the type of expression computable using second-order message passing in linear 
time. Specifically, we can compute for each pair of diversity features (a, b) the value of 

E (n laiVa)) f E <t>aa{ya)] (E (2^0) 



by summing Equation (255) over the possible assignments yr, and then simply assemble the 
results into the matrix C. Since there are ^'^^^"'"^ unique entries in C and message passing 
runs in time 0{M^R), computing C in this fashion requires 0{D^M^R) time. 

We can make several practical optimizations to this algorithm, though they will not affect 
the asymptotic performance. First, we note that the full set of messages at any variable 



node r is sufficient to compute Equation (260). Thus, during message passing we need only 



perform the forward pass; at that point, the messages at the root node are complete and we 
can obtain the quantity we need. This speeds up the algorithm by a factor of two. Second, 
rather than running message passing times, we can run it only once using a vectorized 
second-order semiring. This has no effect on the total number of operations, but can result 
in significantly faster performance due to vector optimizations in modern processors. The 
vectorized second-order semiring is over four-tuples (g, 0, '0, C) where g G M, 0, '0 G M^, and 
C G R^^^, and uses the following operations: 

(gi, 01, ^1, Ci) e fe, 02, ^2, C2) = {qi + ^2, 01 + 02, ^1 + ^2, Ci + C2) (261) 
(gi,01,'01,Cl) ((^2,02,^2,^2) = (^1^2, g'l02 + g'201, ^^1^2 + g'2'01 , 

qiC2 + q2Ci + 0i^J + 02^7) (262) 

= (0,0,0,0) (263) 

1 = (1,0, 0,0). (264) 

It is easy to verify that computations in this vectorized semiring are identical to those 
obtained by repeated use of the scalar semiring. 
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Given C, we can now normalize and compute marginals for an SDPP using the formulas 
in Section 13.31 for instance 



D 



Xr. 



n=l 



An + 1 



D 



n=l 



(265) 
(266) 



where C = J2n=i ^n'^n'^n eigendecomposition of C. 



Part marginals The introduction of structure offers an alternative type of marginal 
probability, this time not of structures y ^ y but of single part assignments. More precisely, 
we can ask how many of the structures in a sample from the SDPP can be expected to make 
the assignment yr to part r: 



E 5^I(2/Gl"Ay^ 



(267) 
(268) 



y^Vr 



The sum is exponential, but we can compute it efficiently using second-order message passing. 



We apply Equation (266) to get 



D 



y^Vr y^Vr n=i 



y^Vr 

D 



n=l 



y^yr 



n=l " yr^yr VaeF / VaSF / 



(269) 

(270) 
(271) 



The result is a sum of D terms, each of which takes the form of Equation (255), and therefore 



is efficiently computable by message passing. The desired part marginal probability simply 
requires D separate applications of belief propagation, one per eigenvector Vni for a total 
runtime of 0{D^M^R). (It is also possible to vectorize this computation and use a single 
run of belief propagation.) Note that if we require the marginal for only a single part /x^(y^), 
we can run just the forward pass if we root the factor tree at part node r. However, by 
running both passes we obtain everything we need to compute the part marginals for any r 
and yr] the asymptotic time required to compute all part marginals is the same as the time 
required to compute just one. 
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6.3.2 Sampling 

While the dual representation provides useful algorithms for normalization and marginals, 
the dual sampling algorithm is linear in N; for SDPPs, this is too slow to be useful. In order 
to make SDPP sampling practical, we need to be able to efficiently choose a structure 
according to the distribution 

Pr(z/.) = ^E(^^^^)' (272) 
in the first line of the while loop in Algorithm [3) We can use the definition of B to obtain 

= ^ E QHyi){v''Hy^)? (273) 

= ^ E f n ^'iy^a)] f E ^^My^a)] ■ (274) 



Thus, the desired distribution has the familiar form of Equation (255). For instance, the 
marginal probability of part r taking the assignment is given by 

7^ E E (n I'm) (t.^'^'^m) ' (275) 

which we can compute with = |y| runs of belief propagation (or a single vectorized run), 
taking only 0{DM^Rk) time. More generally, the message-passing computation of these 
marginals offers an efficient algorithm for sampling individual full structures y^. We will 
first show a naive method based on iterated computation of conditional marginals, and then 
use it to derive a more efficient algorithm by integrating the sampling of parts into the 
message-passing process. 

Single structure sampling Returning to the factor graph used for belief propagation 



(see Section 6.2.1), we can force a part / to take a certain assignment y^/ by adding a 
new singleton factor containing only r\ and setting its weight function to 1 for y^/ and 
otherwise. (In practice, we do not need to actually create a new factor; we can simply set 
outgoing messages from variable to for all but the desired assignment ?/^/.) It is easy to 



see that Equation (246) becomes 



rria^riVr) = ^aiVa) ^ (276) 

aeF{r) y^yr.Vr' aeF 

where the sum is now doubly constrained, since any assignment y that is not consistent with 
y^f introduces a into the product. If ^cteF^c^iVa) gives rise to a probability measure over 



structures then Equation (276) can be seen as the unnormalized conditional marginal 
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Algorithm 9 Sampling a structure (naive) 



Input: factored q and 0, v 

forr = l,2, ...,i?do 

Run second-order belief propagation with: 

• a = b = (j) 

• assignments in S held fixed 
Sample yr according to Vi{yr\S) oc 

S^SiJ {yr} 
end for 

Output: y constructed from S 



probability of the assignment yr given y^/. For example, using the second-order semiring 
with p — q^ and a — h — (j)^ we have 



aeF(r) 



(277) 



Summing these values for all v G V and normalizing the result yields the conditional 



distribution of yr given fixed assignment y^/ under Equation (274). Going forward we will 



assume for simplicity that V contains a single vector v; however, the general case is easily 
handled by maintaining \V\ messages in parallel, or by vectorizing the computation. 

The observation that we can compute conditional probabilities with certain assignments 
held fixed gives rise to a naive algorithm for sampling a structure according to Pr(yJ in 



Equation (274), shown in Algorithm^ While polynomial. Algorithm ^ requires running 



belief propagation R times, which might be prohibitively expensive for large structures. We 
can do better by weaving the sampling steps into a single run of belief propagation. We 
discuss first how this can be done for linear factor graphs, where the intuition is simpler, 
and then extend it to general factor trees. 



Linear graphs Suppose that the factor graph is a linear chain arranged from left to right. 
Each node in the graph has at most two neighbors — one to the left, and one to the right. 
Assume the belief propagation forward pass proceeds from left to right, and the backward 
pass from right to left. To send a message to the right, a node needs only to receive its 
message from the left. Conversely, to send a message to the left, only the message from the 
right is needed. Thus, the forward and backward passes can be performed independently. 

Consider now the execution of Algorithm [9] on this factor graph. Assume the variable 
nodes are numbered in decreasing order from left to right, so the variable sampled in the first 
iteration is the rightmost variable node. Observe that on iteration r, we do not actually need 
to run belief propagation to completion; we need only the messages incoming to variable 
node r, since those suffice to compute the (conditional) marginals for part r. To obtain 
those messages, we must compute all of the forward messages sent from the left of variable 
r, and the backward messages from the right. Call this set of messages m(r). 
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m(r — 1): 



Vr+l 



i< ^ 



m{r): 




Figure 19: Messages on a linear chain. Only the starred messages need to be computed 
to obtain m(r) from m(r — 1). The double circle indicates that assignment i/r-i has been 
fixed for computing m{r). 



Note that m(l) is just a full, unconstrained forward pass, which can be computed in 
time 0{DM^R). Now compare m(r) to m(r — 1). Between iteration r — 1 and r, the only 
change to S is that variable r — 1, to the right of variable r, has been assigned. Therefore the 
forward messages in m(r), which come from the left, do not need to be recomputed, as they 
are a subset of the forward messages in m(r — 1). Likewise, the backward messages sent 
from the right of variable r — 1 are unchanged, so they do not need to be recomputed. The 
only new messages in m(r) are those backward messages traveling from r — 1 to r. These 
can be computed, using m(r — 1) and the sampled assignment yr-i^ in constant time. See 
Figure [T9| for an illustration of this process. 

Thus, rather than restarting belief propagation on each loop of Algorithm [9| we have 
shown that we need only compute a small number of additional messages. In essence we 
have threaded the sampling of parts r into the backward pass. After completing the forward 
pass, we sample we then compute the backward messages from yi to sample 
and so on. When the backward pass is complete, we sample the final assignment yn and 
are finished. Since the initial forward pass takes 0{DM^R) time and each of the 0{R) 
subsequent iterations takes at most 0{DM^) time, we can sample from Pr(yJ over a linear 
graph in 0{DM^R) time. 



Trees The algorithm described above for linear graphs can be generalized to arbitrary 
factor trees. For standard graphical model sampling using the sum-product semiring, the 
generalization is straightforward — we can simply pass messages up to the root and then 
sample on the backward pass from the root to the leaves. However, for arbitrary semirings 
this is algorithm is incorrect, since an assignment to one node can affect the messages 
arriving at its siblings even when the parent's assignment is fixed. 

Let mi)^a{'\S) be the message function sent from node h to node a during a run of belief 
propagation where the assignments in S have been held fixed. Imagine that we re-root the 
factor tree with a as the root; then define Ta{h) to be the subtree rooted at h (see Figure 20). 
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Several useful observations follow. 

Lemma 6.1. Ifbi and 62 Q^^e distinct neighbors of a, then Ta{bi) and Ta{b2) are disjoint. 

Proof. The claim is immediate, since the underlying graph is a tree. □ 

Lemma 6.2. m})^a['\S) can be computed given only the messages mc^})[-\S) for all neighbors 
a ofb and either the weight function w^, (if b is a factor node) or the assignment to b in 
S (ifb is a variable node and such an assignment exists). 



Proof. Follows from the message definitions in Equations ( 244 ) and ( 245 ) . □ 

Lemma 6.3. m})^a{'\S) depends only on the assignments in S that give values to variables 
inTa{b). 

Proof. If 6 is a leaf (that is, its only neighbor is a), the lemma holds trivially. If b is not 
a leaf, then assume inductively that incoming messages mc^i,{-\S)^ depend only on 

assignments to variables in T\y(c). By Lemma 6.2, the message m5^a('|5') depends only on 
those messages and (possibly) the assignment to b in S. Since b and T}){c) are subgraphs of 
Ta{b)^ the claim follows. □ 

To sample a structure, we begin by initializing Sq = 9 and setting messages rhi^^a — 
mi)^a{'\So) for all neighbor pairs (a, 6). This can be done in 0{DM^R) time via belief 
propagation. 

Now we walk the graph, sampling assignments and updating the current messages m^^^ 
as we go. Step t from node 6 to a proceeds in three parts as follows: 

1. Check whether 6 is a variable node without an assignment in St-i. If so, sample an 
assignment yi^ using the current incoming messages rhc^i,, and set St = St-i U {yt}- 
Otherwise set St = St-i- 
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2. Recompute and update rhij^a using the current messages and Equations (244) and (245), 
taking into account any assignment to h in St- 

3. Advance to node a. 

This simple algorithm has the following useful invariant. 

Theorem 6.4. Following step t from b to a, for every neighbor d of a we have 

rhd^a rnd^a{-\St) . (278) 



Proof. By design, the theorem holds at the outset of the walk. Suppose inductively that 
the claim is true for steps l,2,...,t — 1. Let f be the most recent step prior to t at which 
we visited a, or if step t was our first visit to a. Since the graph is a tree, we know that 
between steps t^ and t the walk remained entirely within Ta{b). Hence the only assignments 
in St — St' are to variables in Ta{b). As a result, for all neighbors d 7^ 6 of a we have 



— '^d^a{'\St) by the inductive hypothesis. Lemma |6.1l and Lemma |6. 3 



It remains to show that rhiy^a — '^h^a{'\Si). For all neighbors c 7^ a of 6, we know that 



rhc^i) — mc^i){-\Si-i) — mc^i){-\St) due to the inductive hypothesis and Lemma 6.3 (since b 



is not in T}j{c)). By Lemma [6^ then, we have rhi,^a — ^6^a('|5't). □ 

Theorem |6.4| guarantees that whenever we sample an assignment for the current variable node 
in the first part of step t, we sample from the conditional marginal distribution Pr(y5|S't_i). 
Therefore, we can sample a complete structure from the distribution Pr(y) if we walk the 
entire tree. This can be done, for example, by starting at the root and proceeding in 
depth- first order. Such a walk takes 0{R) steps, and each step requires computing only 
a single message. Thus, allowing now for A: = |y| > 1, we can sample a structure in time 
0{DM^Rk)^ a significant improvement over Algorithm [oj The procedure is summarized in 
Algorithm [TOl 



Algorithm 10 is the final piece of machinery needed to replicate the DPP sampling 
algorithm using the dual representation. The full SDPP sampling process is given in 
Algorithm 11 and runs in time 0{D^k^ + DM^Rk^)^ where k is the number of eigenvectors 



selected in the first loop. As in standard DPP sampling, the asymptotically most expensive 
operation is the orthonormalization; here we require 0{D^) time to compute each of the 
0{k'^) dot products. 

6.4 Experiments: pose estimation 

To demonstrate that SDPPs effectively model characteristics of real-world data, we apply 



them to a multiple-person pose estimation task [Kulesza and Taskar , 2010] . Our input will 



be a still image depicting multiple people, and our goal is to simultaneously identify the 
poses — the positions of the torsos, heads, and left and right arms — of all the people in the 
image. A pose y is therefore a structure with four parts, in this case literally body parts. To 
form a complete structure, each part r is assigned a position/orientation pair y^. Our quality 
model will be based on "part detectors" trained to estimate the likelihood of a particular 
body part at a particular location and orientation; thus we will focus on identifying poses 
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Algorithm 10 Sampling a structure 



Input: factored q and 0, v 

Initialize rha^h using second-order belief propagation with p — q^^ a — h — v'^c/) 
Let ai, a2, . . . , ttT be a traversal of the factor tree 
for t = 1,2, ...,Tdo 

if at is a variable node r with no assignment in S then 
Sample yr according to Pr(?/^) oc (8)aGF(r) ^a^riVr) 

end if 

if t < T then 

Update rhat^at+i using Equations (244) and (245), fixing assignments in S 
end if 
end for 

Output: y constructed from S 



that correspond well to the image itself. Our similarity model, on the other hand, will focus 
on the location of a pose within the image. Since the part detectors often have uncertainty 
about the precise location of a part, there may be many variations of a single pose that 
outscore the poses of all the other, less detectable people. An independent model would 
thus be likely to choose many similar poses. By encouraging the model to choose a spatially 
diverse set of poses, we hope to improve the chance that the model predicts a single pose for 
each person. 

Our dataset consists of 73 still frames taken from various TV shows, each approximately 
720 by 540 pixels in size |Sapp et al., 2010p As much as possible, the selected frames 
contain three or more people at similar scale, all facing the camera and without serious 



occlusions. Sample images from the dataset are shown in Figure 22, Each person in each 
image is annotated by hand; each of the four parts (head, torso, right arm, and left arm) is 
labeled with the pixel location of a reference point (e.g., the shoulder) and an orientation 
selected from among 24 discretized angles. 



6.4.1 Factorized model 

There are approximately 75,000 possible values for each part, so there are about 4''^'^^^ 
possible poses, and thus we cannot reasonably use a standard DPP for this problem. Instead, 
we build a factorized SDPP. Our factors are given by the standard pictorial structure model 
[Felzenszwalb and Huttenlocher , 2005, Fischler and Elschlager, 1973| , treating each pose 
as a two-level tree with the torso as the root and the head and arms as leaves. Each node 
(body part) has a singleton factor, and each edge has a corresponding pairwise factor. 

Our quality function derives from the model proposed by Sapp et al. [2010 , and is given 



^The images and code were obtained from http://www.vision.grasp.upenn.edu/video 
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Algorithm 11 Sampling from an SDPP 



Input: eigendecomposition {{vn^ ^n)}n=i ^ 

forn = 1,2, . . . ,7V do 

J ^ J U {n} with prob. -y^^ 
end for 

y ^ 

while \V\ > do 

Select Vi from y with Pr(7/J = ^ EvevH^^^V ^i)^ (Algorithm 
Y^YUVi 

V <— V±, where {B^v \ v G Vj^} is an orthonormal basis for the subspace of V 
orthogonal to Cj 

end while 
Output: Y 



10) 



by 

Q{y) = 7 n n ^ryiVr^ Vr') , (279) 

\r=l {ry)^E ) 

where E is the set of edges in the part tree, 7 is a scale parameter that will control the 
expected number of poses in an SDPP sample, and /3 is a sharpness parameter that controls 
the dynamic range of the quality scores. We set the values of the hyperparameters 7 and 
/3 using a held-out training set, as discussed below. The per-part quality scores QriVr) are 
provided by the customized part detectors trained by Sapp et al. |2010| on similar images; 



they assign a value to every proposed location and orientation yj. of part r. The pairwise 
quality scores qryiVr^Vr') ^^'^ defined according to a Gaussian "spring" that encourages, for 
example, the left arm to begin near the left shoulder of the torso. Full details of the model 



are provided in Sapp et al.j [2010 . 



In order to encourage the model not to choose overlapping poses, our diversity features 
reflect the locations of the constituent parts: 



R 



{y) -Y.'^riyr) . (280) 



r=l 



where each (t)r{yr) ^ • There are no diversity features on the edge factors. The local 
features are based on a 8 x 4 grid of reference points xi, X2, . . . , X32 spaced evenly across the 
image; the /th feature is 

MVr) oc fM ( ^^^^^) . (281) 



Here fj^ is again the standard normal density function, and dist(y^,x/) is the Euclidean 
distance between the position of part r (ignoring orientation) and the reference point x/. 
Poses that occupy the same part of the image will be near the same reference points, and 
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thus their feature vectors (j) wih be more closely aligned. The parameter a controls the width 
of the kernel; larger values of a make poses at a given distance appear more similar We set 
(J on a held-out training set. 



6.4.2 Methods 

We compare samples from the SDPP defined above to those from two baseline methods. 
The first, which we call the independent model, draws poses independently according to the 
distribution obtained by normalizing the quality scores, which is essentially the graphical 
model used by Sapp et aT] |201 0|. For this model the number of poses to be sampled must 



be supplied by the user, so to create a level playing field we choose the number of poses 
in an SDPP sample Y. Since this approach does not incorporate a notion of diversity (or 
any correlations between selected poses whatsoever), we expect that we will frequently see 
multiple poses that correspond to the same person. 

The second baseline is a simple non-maximum suppression model [Canny , 1986 , which 



incorporates a heuristic for encouraging diversity. The first pose is drawn from the normalized 
quality model in the same manner as for the independent method. Subsequent poses, however, 
are constrained so that they cannot overlap with the previously selected poses, but otherwise 
drawn according to the quality model. We consider poses overlapping if they cover any 
of the same pixels when rendered. Again, the number of poses must be provided as an 
argument, so we use the number of poses from a sample of the SDPP. While the non-max 
approach can no longer generate multiple poses in the same location, it achieves this using a 
hard, heuristic constraint. Thus, we might expect to perform poorly when multiple people 
actually do overlap in the image, for example if one stands behind the other. 

The SDPP, on the other hand, generates samples that prefer, but do not require poses 
to be spatially diverse. That is, strong visual information in the image can override our 
prior assumptions about the separation of distinct poses. We split our data randomly into a 
training set of 13 images and a test set of 60 images. Using the training set, we select values 
for 7, /3, and a that optimize overall Fi score at radius 100 (see below), as well as distinct 
optimal values of /3 for the baselines. (7 and a are irrelevant for the baselines.) We then use 
each model to sample 10 sets of poses for each test image, for a total of 600 samples per 
model. 



6.4.3 Results 

For each sample from each of the three tested methods, we compute measures of precision 
and recall as well as an Fi score. In our tests, precision is measured as the fraction of 
predicted parts for which both endpoints are within a given radius of the endpoints of an 
expert-labeled part of the same type (head, left arm, and so on). We report results across a 
range of radii. Correspondingly, recall is the fraction of expert-labeled parts with endpoints 
within a given radius of a predicted part of the same type. Since the SDPP model encourages 
diversity, we expect to see improvements in recall at the expense of precision, compared to 
the independent model. Fi score is the harmonic mean of precision and recall. We compute 
all metrics separately for each sample, and then average the results across samples and 
images in the test set. 
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Overall Arms Precision / recall (circles) 




Match radius (in pixels) Match radius (in pixels) Match radius (in pixels) 

(a) (b) (c) 

Figure 21: Results for pose estimation. The horizontal axis gives the acceptance radius 
used to determine whether two parts are successfully matched. 95% confidence intervals are 
shown, (a) Overall Fi scores, (b) Arm Fi scores, (c) Overall precision/recall curves (recall 
is identified by circles). 



The results are shown in Figure [2Ta| At tight tolerances, when the radius of acceptance 



is small, the SDPP performs comparably to the independent and non-max samples, perhaps 
because the quality scores are simply unreliable at this resolution, thus diversity has little 
effect. As the radius increases, however, the SDPP obtains better results, significantly 
outperforming both baselines. Figure 2Tb] shows the curves for just the arm parts, which 



tend to be more difficult to locate accurately and exhibit greater variance in orientation. 



Figure |21c| shows the precision/recall obtained by each model. As expected, the SDPP 
model achieves its improved Fi score by increasing recall at the cost of precision. 

For illustration, we show the SDPP sampling process for some sample images from the 
test set in Figure [22j The SDPP part marginals are visualized as a "cloud" , where brighter 
colors correspond to higher probability. From left to right, we can see how the marginals 
change as poses are selected during the main loop of Algorithm As we saw for simple 
synthetic examples in Figure [Sal the SDPP discounts but does not entirely preclude poses 



that are similar to those already selected. 
6.5 Random projections for SDPPs 

It is quite remarkable that we can perform polynomial-time inference for SDPPs given 
their extreme combinatorial nature. Even so, in some cases the algorithms presented in 



Section |6.3| may not be fast enough. Eigendecomposing the dual representation C, for 
instance, requires 0{D^) time, while normalization, marginalization, and sampling, even 
when an eigendecomposition has been precomputed, scale quadratically in both in terms 
of time and memory. In practice, this limits us to relatively low-dimensional diversity 
features 0; for example, in our pose estimation experiments we built from a fairly coarse 
grid of 32 points mainly for reasons of efficiency. As we move to textual data, this will 
become an even bigger problem, since there is no natural low-dimensional analog for feature 
vectors based on, say, word occurrences. In the following section we will see data where 
natural vectors (j) have dimension D ^ 30,000; without dimensionality reduction, storing 
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Figure 22: Structured marginals for the pose estimation task, visualized as clouds, on 
successive steps of the sampling algorithm. Already selected poses are superimposed. Input 
images are shown on the left. 
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even a single belief propagation message would require over 200 terabytes of memory. 

To address this problem, we will make use of the random projection technique described 



in Section |3.4[ reducing the dimension of the diversity features without sacrificing the 
accuracy of the model. Because Theorem |3.3| depends on a cardinality condition, we will 
focus on fc-SDPPs. As described in Section [5| a fc-DPP is simply a DPP conditioned on the 
cardinality of the modeled subset Y: 



ViY) = ^ ^ , (282) 

E\Y'\=k [UyeYi'iy)) demrycpiY)) 

where 0(1^) denotes the D x |y| matrix formed from columns (l){y) for y gY. When q and 



(j) factor over parts of a structure, as in Section [6T| we will refer to this distribution as a 
fc-SDPP. We note in passing that the algorithms for normalization and sampling in Section |5] 
apply equally well to /c-SDPPs, since they depend mainly on the eigenvalues of L, which we 
can obtain from C. 



Recall that Theorem 3.3 requires projection dimension 

d = 0(max{/c/e, (log(l/5) + logTV)^^}) . (283) 

In the structured setting, N = M^, thus d must be logarithmic in the number of labels and 
linear in the number of parts. Under this condition, we have, with probability at least 1 — 5, 

\\V^ -P^Wi <e^^^-l, (284) 

where V^{Y) is the projected fc-SDPP. 



6.5.1 Toy example: geographical paths 

In order to empirically study the effects of random projections, we test them on a simple 
toy application where D is small enough that the exact model is tractable. The goal is to 
identify diverse, high-quality sets of travel routes between U.S. cities, where diversity is 
with respect to geographical location, and quality is optimized by short paths visiting the 
most populous or most touristy cities. Such sets of paths could be used, for example, by a 
politician to plan campaign routes, or by a traveler organizing a series of vacations. 

We model the problem as a fc-SDPP over path structures having i? = 4 parts, where 
each part is a stop along the path and can take any of M = 200 city values. The quality 
and diversity functions are factored, with a singleton factor for every individual stop and 
pairwise factors for consecutive pairs of stops. The quality of a singleton factor is based 
on the Google hit count for the assigned city, so that paths stopping in popular cities are 
preferred. The quality of a pair of consecutive stops is based on the distance between the 
assigned cities, so that short paths are preferred. In order to disallow paths that travel 
back and forth between the same cities, we augment the stop assignments to include arrival 
direction, and assign a quality score of zero to paths that return in the direction from which 
they came. The diversity features are only defined on the singleton factors; for a given city 
assignment y^, (j)r{yr) is just the vector of inverse distances between and all of the 200 
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Figure 23: Each column shows two samples drawn from a fc-SDPP; from left to right, 
= 2, 3, 4. Circle size corresponds to city quality. 



cities. As a result, paths passing through the same or nearby cities appear similar, and the 
model prefers paths that travel through different regions of the country. We have D — 200. 

Figure [23] shows sets of paths sampled from the fc-SDPP for various values of k. For 
/c = 2, the model tends to choose one path along the east coast and another along the 
west coast. As k increases, a variety of configurations emerge; however, they continue to 
emphasize popular cities and the different paths remain geographically diverse. 

We can now investigate the effects of random projections on this model. Figure |24] shows 
the Li variational distance between the original model and the projected model (estimated 
by sampling), as well as the memory required to sample a set of paths for a variety of 
projection dimensions d. As predicted by Theorem |3.3[ only a relatively small number of 
projection dimensions are needed to obtain a close approximation to the original model. 
Past (i 25, the rate of improvement due to increased dimension falls off dramatically; 
meanwhile, the required memory and running time start to become significant. Figure [24] 
suggests that aggressive use of random projections, like those we employ in the following 
section, is not only theoretically but empirically justified. 



6.6 Experiments: threading graphs 

In this section we put together many of the techniques introduced in this paper in order to 
complete a novel task that we refer to as graph threading [Gillenwater et al. , 2012 . The goal 



is to extract from a large directed graph a set of diverse, salient threads^ or singly-connected 
chains of nodes. Depending on the construction of the graph, such threads can have various 
semantics. For example, given a corpus of academic literature, high-quality threads in the 
citation graph might correspond to chronological chains of important papers, each building 
on the work of the last. Thus, graph threading could be used to identify a set of significant 
lines of research. Or, given a collection of news articles from a certain time period, where 
each article is a node connected to previous, related articles, we might want to display the 
most significant news stories from that period, and for each story provide a thread that 
contains a timeline of its major events. We experiment on data from these two domains in 
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1.2 




Projection dimension 

Figure 24: The effect of random projections. In black, on the left, we estimate the Li 
variational distance between the original and projected models. In blue, on the right, we 
plot the memory required for sampling, which is also proportional to running time. 




Figure 25: An illustration of graph threading applied to a document collection. We first 
build a graph from the collection, using measures of importance and relatedness to weight 
nodes (documents) and build edges (relationships). Then, from this graph, we extract a 
diverse, salient set of threads to represent the collection. 



the following sections. Other possibilities might include discovering trends on social media 
sites, for example, where users can post image or video responses to each other, or mining 



blog entries for important conversations through trackback links. Figure [25] gives an overview 
of the graph threading task for document collections. 

Generally speaking, graph threading offers a means of gleaning insights from collections 
of interrelated objects — for instance, people, documents, images, events, locations, and so 
on — that are too large and noisy for manual examination. In contrast to tools like search, 
which require the user to specify a query based on prior knowledge, a set of threads provides 
an immediate, concise, high-level summary of the collection, not just identifying a set of 
important objects but also conveying the relationships between them. As the availability of 
such datasets continues to grow, this kind of automated analysis will be key in helping us to 
efficiently and effectively navigate and understand the information they contain. 

6.6.1 Related work 



Research from to the Topic Detection and Tracking (TDT) program I Waynej |2000| has 



led to useful methods for tasks like link detection, topic detection, and topic tracking that 
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can be seen as subroutines for graph threading on text collections. Graph threading with 
fc-SDPPs, however, addresses these tasks jointly, using a global probabilistic model with a 
tractable inference algorithm. 



Other work in the topic tracking literature has addressed related tasks Mei and Zhai 



2005, Blei and Lafferty, 2006, Leskovec et al., 20091. In particular, Blei and Lafferty 2006 



proposed dynamic topic models (DTMs), which, given a division of text documents into 
time slices, attempt to fit a generative model where topics evolve over time, and documents 
are drawn from the topics available at the time slice during which they were published. The 
evolving topics found by a DTM can be seen as threads of a sort, but in contrast to graph 
threading they are not composed of actual items in the dataset (in this case, documents). 
In Section |6.6.4 we will return to this distinction when we compare fc-SDPP threading to a 
DTM baseline. 

The information retrieval community has produced other methods for extracting temporal 
information from document collections. Swan and Jensen |2000| proposed a system for 
finding temporally clustered named entities in news text and presenting them on a timeline. 
Allan et al. [2001J introduced the task of temporal summarization^ which takes as input a 



stream of news articles related to a particular topic, and then seeks to extract sentences 
describing important events as they occur. Yan et al. [2011] evaluated methods for choosing 
sentences from temporally clustered documents that are relevant to a query. In contrast, 
graph threading seeks not to extract grouped entities or sentences, but instead to organize a 
subset of the objects (documents) themselves into threads, with topic identification as a side 
effect. 



Some prior work has also focused more directly on threading. Shahaf and Guestrin 



[2010J and Chieu and Lee [2004 proposed methods for selecting individual threads, while 



Shahaf et al. 2012 recently proposed metro maps as alternative structured representations 



of related news stories. Metro maps are effectively sets of non-chronological threads that 
are encouraged to intersect and, in doing so, generate a map of events and topics. However, 
these approaches assume some prior knowledge about content. Shahaf and Guestrin [20101, 



for example, assume the thread endpoints are specified, and Chieu and Lee 2004 require 



a set of query words. Likewise, because they build metro maps individually, Shahaf et al. 



1 2012 implicitly assume that the collection is filtered to a single topic, perhaps from a user 
query. These inputs make it possible to quickly pare down the document graph. In contrast, 
we will apply graph threading to very large graphs, and consider all possible threads. 



6.6.2 Setup 

In order to be as useful as possible, the threads we extract from a data graph need to be 
both high quality, reflecting the most important parts of the collection, and diverse, so that 
they cover distinct aspects of the data. In addition, we would like to be able to directly 
control both the length and the number of threads that we return, since different contexts 
might necessitate different settings. Finally, to be practical our method must be efficient 
in both time and memory use. To achieve these various goals, we will model the graph 
threading problem as a fc-SDPP with random projections. 

Given a directed graph on M vertices with edge set E and a real- valued weight function 
w{-) on nodes and edges, define the weight of a thread y = (^1,^2, • • • , Vr), {Vn Ur+i) ^ E 
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by 

R R 



^(y) = "^^(Vr) + X]'^(yr-l,2/r) • (285) 

r=l r=2 

We can use w to define a simple log-linear quality model for our fc-SDPP: 



q{y) = exp(/3^(2/)) (286) 

/ R R \ ^ 

= Ylexp{w{yr))Ylexp{w{yr-i,yr)) , (287) 



\r=l r=2 



where /3 is a hyperparameter controlling the dynamic range of the quality scores. We fix the 
value of /3 on a validation set in our experiments. 

Likewise, let be a feature function from nodes in the graph to R^; then the diversity 
feature function on threads is 



R 



^{y) = J2^(yr)' (288) 



r=l 



In some cases it might also be convenient to have diversity features on edges of the graph 
as well as nodes. If so, they can be accommodated without much difficulty; however, for 
simplicity we proceed with the setup above. 

We assume that i?, A:, and the projection dimension d are provided; the first two depend 
on application context, and the third, as discussed in Section [6^ is a tradeoff between 
computational efficiency and faithfulness to the original model. To generate diverse thread 
samples, we first project the diversity features by a random d x D matrix G whose entries 
are drawn independently and identically from A/'(0, ^). We then apply second-order message 



passing to compute the dual representation C, as in Section 6.3. 1[ After eigendecomposing 



C, which is only d x d due to the p rojection, we can run the first phase of the fc-DPP 



sampling algorithm from Section 5.2.2 to choose a set V of eigenvectors, and finally complete 



the SDPP samphng algorithm in Section 6.3.2|to obtain a set of k threads Y. We now apply 



this model to two datasets; one is a citation graph of computer science papers, and the other 
is a large corpus of news text. 

6.6.3 Academic citation data 

The Cora dataset comprises a large collection of approximately 200,000 academic papers 
on computer science topics, including citation information [McCallum et al.' , 2000| . We 



construct a directed graph with papers as nodes and citations as edges, and then remove 
papers with missing metadata or zero outgoing citations, leaving us with 28,155 papers. The 
average out-degree is 3.26 citations per paper, and 0.011% of the total possible edges are 
present in the graph. 

To obtain useful threads, we set edge weights to reflect the degree of textual similarity 
between the citing and the cited paper, and node weights to correspond with a measure of 
paper "importance". Specifically, the weight of edge (a, 6) is given by the cosine similarity 
metric, which for two documents a and b is the dot product of their normalized tf-idf vectors. 
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as defined in Section [4.2.11 



cos-sim(a, b) — — , (289) 



Here is a subset of the words found in the documents. We select W by filtering according 
to document frequency; that is, we remove words that are too common, appearing in more 
than 10% of papers, or too rare, appearing in only one paper. After filtering, there are 
50,912 unique words. 



The node weights are given by the LexRank score of each paper |Erkan and Radev 



2004| . The LexRank score is the stationary distribution of the thresholded, binarized. 



row- normalized matrix of cosine similarities, plus a damping term, which we fix to 0.15. 
LexRank is a measure of centrality, so papers that are closely related to many other papers 
will receive a higher score. 

Finally, we design the diversity feature function to encourage topical diversity. Here 
we apply cosine similarity again, representing a document by the 1,000 documents to which 
it is most similar. This results in binary (j) of dimension D — M — 28, 155 with exactly 
1,000 non-zeros; (t)i{yr) — 1 implies that / is one of the 1,000 most similar documents to 
y^. Correspondingly, the dot product between the diversity features of two documents is 
proportional to the fraction of top- 1,000 documents they have in common. In order to make 
fc-SDPP inference efficient, we project (j) down to d = 50 dimensions. 



Figure [26| illustrates the behavior of the model when we set A: = 4 and i? = 5. Samples 
from the model, like the one presented in the figure, offer not only some immediate intuition 
about the types of papers contained in the collection, but, upon examining individual threads, 
provide a succinct illustration of the content and development of each area. Furthermore, 



the sampled threads cover distinct topics, standing apart visually in Figure [26] and exhibiting 
diverse salient terms. 

6.6.4 News articles 

Our news dataset comprises over 200,000 articles from the New York Times, collected from 



2005-2007 as part of the Enghsh Gigaword corpus [Graff and Cierif |2009| . We split the 



articles into six groups, with six months' worth of articles in each group. Because the corpus 
contains a significant amount of noise in the form of articles that are short snippets, lists 
of numbers, and so on, we filter the results by discarding articles more than two standard 
deviations longer than the mean article, articles less than 400 words, and articles whose 
fraction of non-alphabetic words is more than two standard deviations above the mean. On 
average, for each six-month period we are left with 34,504 articles. 

For each time period, we generate a graph with articles as nodes. As for the citations 
dataset, we use cosine similarity to define edge weights. The subset of words W used to 
compute cosine similarity contains all words that appear in at least 20 articles and at most 
15% of the articles. Across the six time periods, this results in an average of 36,356 unique 
words. We include in our graph only those edges with cosine similarity of at least 0.1; 
furthermore, we require that edges go forward in time to enforce the chronological ordering 
of threads. The resulting graphs have an average of 0.32% of the total possible edges, and an 
average degree of 107. As before, we use LexRank for node weights, and the top-1000 similar 
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Thread: learning lifelong training tasks invariances control 

1. Locally Weighted Learning for Control 

2. Discovering Structure in Multiple Learning Tasks: The TC Algorithm 

3. Learning One More Thing 

4. Explanation Based Learning for Mobile Robot Perception 

5. Learning Analytically and Inductively 



Thread: mobile clients hoard server client database 



1. A Database Architecture for Handling Mobile Clients 

2. An Architecture for Mobile Databases 

3. Database Server Organization for Handling Mobile Clients 

4. Mobile Wireless Computing: Solutions and Challenges in Data Management 

5. Energy Efficient Query Optimization 



Figure 26: Sampled threads from a 4-SDPP with thread length i? = 5 on the Cora dataset. 
Above, we plot a subset of the Cora papers, projecting their tf-idf vectors to two dimensions 
by running PCA on the centroids of the threads, and then highlight the thread selections in 
color. Displayed beside each thread are the words in the thread with highest tf-idf score. 
Below, we show the titles of the papers in two of the threads. 
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Figure 27: Visualization of a single article node and all of its neighboring article nodes. 



documents to define a binary feature function <j). We add a constant feature p to (/), which 
controls the overall degree of repulsion; large values of p make all documents more similar 
to one another. We set p and the quality model hyperparameters to maximize a cosine 



similarity evaluation metric (see Section 6.6.4), using the data from the first half of 2005 as 



a development set. Finally, we randomly project the diversity features from D ^ 34, 500 
to (i = 50 dimensions. For all of the following experiments, we use /c = 10 and R — S. All 
evaluation metrics we report are averaged over 100 random samples from the model. 

Graph visualizations In order to convey the scale and content of the graphs built 



from news data, we provide some high-resolution renderings. Figure [27| shows the graph 
neighborhood of a single article node from our development set. Each node represents an 
article and is annotated with the corresponding headline; the size of each node reflects its 
weight, as does the thickness of an edge. The horizontal position of a node corresponds to 
the time at which the article was published, from left to right; the vertical positions are 



optimized for readability. In the digital version of this paper. Figure 27 can be zoomed in 



order to read the headlines; in hardcopy, however, it is likely to be illegible. As an alternative. 



an online, zoomable version of the figure is available at |http : //zoom . it/GUCR 



Visualizing the entire graph is quite challenging since it contains tens of thousands of 
nodes and millions of edges; placing such a figure in the paper would be impractical since 
the computational demands of rendering it and the zooming depth required to explore it 
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would exceed the abilities of modern document viewers. Instead, we provide an online, 
zoomable version based upon a high-resolution (540 megapixel) rendering, available at 
http://zoom.it/jOKV. Even at this level of detail, only 1% of the edges are displayed; 
otherwise they become visually indistinct. As in Figure [27| each node represents an article 
and is sized according to its weight and overlaid with its headline. The horizontal position 
corresponds to time, ranging from January 2005 (on the left) to June 2005 (on the right). 
The vertical positions are determined by similarity to a set of threads sampled from the 
fc-SDPP, which are rendered in color. 



Baselines We will compare the fc-SDPP model to two natural baselines. 

k-means baseline. A simple method for this task is to split each six-month period of 
articles into R equal-size time slices, and then apply fc-means clustering to each slice, using 
cosine similarity at the clustering metric. We can then select the most central article from 
each cluster to form the basis of a set of threads. The k articles chosen from time slice r are 
matched one-to-one with those from slice r — 1 by computing the pairing that maximizes the 
average cosine similarity of the pairs — that is, the coherence of the threads. Repeating this 
process for all r yields a set of k threads of length i?, where no two threads will contain the 
same article. However, because clustering is performed independently for each time slice, it 
is likely that the threads will sometimes exhibit discontinuities when the articles chosen at 
successive time slices do not naturally align. 



DTM baseline. A natural extension, then, is the dynamic topic model (DTM) of Blei 



and Lafferty 2006| , which explicitly attempts to find topics that are smooth through time. 
We use publicly available cod^ to fit DTMs with the number of topics set to k and with 
the data split into R equal time slices. We set the hyperparameters to maximize the cosine 
similarity metric (see Section [6. 6. 4| ) on our development set. We then choose, for each topic 
at each time step, the document with the highest per-word probability of being generated 
by that topic. Documents from the same topic form a single thread. 



Figure 28 shows some of the threads sampled randomly from the fc-SDPP for our develop- 
ment set, and Figure 29 shows the same for threads produced by the DTM baseline. An 
obvious distinction is that topic model threads always span nearly the entire time period, 
selecting one article per time slice as required by the form of the model, while the DPP can 
select threads covering only the relevant span. Furthermore, the headlines in the figures 
suggest that the fc-SDPP produces more tightly focused, narrative threads due to its use of 
the data graph, while the DTM threads, though topically related, tend not to describe a 
single continuous news story. This distinction, which results from the fact that topic models 
are not designed with threading in mind, and so do not take advantage of the explicit relation 
information given by the graph, means that fc-SDPP threads often form a significantly more 
coherent representation of the news collection. 



Comparison to human summaries We provide a quantitative evaluation of the threads 
generated by our baselines and sampled from the fc-SDPP by comparing them to a set 
of human-generated news summaries. The human summaries are not threaded; they are 

^http : / / code . google . com/p/princeton-statistical-learning/ 
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Thread: pope Vatican church parkinson 

Parkinson's Disease Increases Risks to Pope 

Pope's Health Raises Questions About His Abihty to Lead 

Pope Returns Home After 18 Days at Hospital 

Pope's Condition Worsens as World Prepares for End of Papacy 

Pope, Though Gravely 111, Utters Thanks for Prayers 

Europeans Fast Falling Away from Church 

In Developing World, Choice [of Pope] Met with Skepticism 

Pope Sends Message with Choice of Name 



Figure 28: A set of five news threads randomly sampled from a fc-SDPP for the first half of 
2005. Above, the threads are shown on a timeline with the most salient words superimposed; 
below, the dates and headlines from a single thread are listed. 
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hotel kitchen casa inches post shade monica closet ' 
0^ mets rangers dodgers delgado martinez astacio angels mientkiewicz 
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cancer heart breast women disease aspirin risk study 
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Thread: cancer heart breast women disease aspirin risk study 

Jan 11: Study Backs Meat, Colon Tumor Link 

Feb 07: Patients — and Many Doctors — Still Don't Know How Often Women 
Get Heart Disease 

Mar 07: Aspirin Therapy Benefits Women, but Not in the Way It Aids Men 
Mar 16: Study Shows Radiation Therapy Doesn't Increase Heart Disease Risk 
for Breast Cancer Patients 



Apr 11 
May 16 
May 24 

Jun 21 



Personal Health: Women Struggle for Parity of the Heart 
Black Women More Likely to Die from Breast Cancer 
Studies Bolster Diet, Exercise for Breast Cancer Patients 
Another Reason Fish is Good for You 



Figure 29: A set of five news threads generated by the dynamic topic model for the first 
half of 2005. Above, the threads are shown on a timeline with the most salient words 
superimposed; below, the dates and headlines from a single thread are listed. 
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ROUGE-1 ROUGE-2 ROUGE-SU4 

System Cos-sim F Prec/Rec F Prec/Rec F Prec/Rec 

fc-means 29^9 16^5 17.3/15.8 0.695 0.73/0.67 3?76 3.94/3.60 

DTM 27.0 14.7 15.5/14.0 0.750 0.81/0.70 3.44 3.63/3.28 

fc-SDPP 33.2 17.2 17.7/16.7 0.892 0.92/0.87 3.98 4.11/3.87 

Table 10: Similarity of automatically generated timelines to human summaries. Bold 
entries are significantly higher than others in the column at 99% confidence, verified using 
bootstrapping. 

System Rating Interlopers 
fc-means 2.73 0.71 
DTM 3.19 1.10 
fc-SDPP 3.31 1.15 

Table 11: Rating: average coherence score from 1 (worst) to 5 (best). Interlopers: average 
number of interloper articles identified (out of 2). Bold entries are significantly higher with 
95% confidence. 



flat, approximately daily news summaries found in the Agence France-Presse portion of the 
Gigaword corpus, distinguished by their "multi" type tag. The summaries generally cover 
world news, which is only a subset of the contents of our dataset. Nonetheless, they allow 
us to provide an extrinsic evaluation for this novel task without generating gold standard 
timelines manually, which is a difficult task given the size of the corpus. We compute four 
metrics: 

• Cosine similarity. We concatenate the human summaries over each six-month period 
to obtain a target tf-idf vector, concatenate the set of threads to be evaluated to obtain 
a predicted tf-idf vector, and then compute the cosine similarity (in percent) between 
the target and predicted vectors. All hyperparameters are chosen to optimize this 
metric on a validation set. 



• ROUGE-1, 2, and SU4. As described in Section |4XT| ROUGE is an automatic 
evaluation metric for text summarization based on n-gram overlap statistics [Lin , 2004| . 
We report three standard variants. 



Table 10 shows the results of these comparisons, averaged over all six half-year inter- 
vals. Under each metric, the fc-SDPP produces threads that more closely resemble human 
summaries. 

Mechanical Turk evaluation An important distinction between the baselines and the 
fc-SDPP is that the former are topic-oriented, choosing articles that relate to broad subject 
areas, while the fc-SDPP approach is story-oriented^ chaining together articles with direct 



individual relationships. An example of this distinction can be seen in Figures [28| and [29 

To obtain a large-scale evaluation of this type of thread coherence, we employ Mechanical 
Turk, on online marketplace for inexpensively and efficiently completing tasks requiring 
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Edit a news timeline 

* You will see a series of news headlines arranged chronologically, 

* Your goa] is lo help ihe timeline tell a single clear slory. 

* Please seleci exactly two articles that you think should be removed to improve the flow of the timeline. 

* If you aren't surSj use your best judgment. 

* To get more informatioUj you can hover your mouse over a headline to see the beginning of the article text. 

1 □ Jan 06, ZO05: GM TO ABSORB AND REPACKAGE MONEY-LOSING SATURN 

a □ Jan 06, 2005: DEMOCRATS TRY TO ALTER SOCML SECURFTY DERATE 

a □ Jan 08, 2005: 204a AND ALL THAT: UNTANGLING THE DEBATE ON SOCIAL SECURTTY 

4 □ Feb 04, 2005: PRIVATELY OPERATED TOLL ROADS, COMMON IN EUROPE, MAY BE THE FUTURE IN THE U.S. 

5 uApr oz, 2005: FEW SEE GAINS FROM SOCIAL SECURITY TOUR 

6 Apr 19, 2005: CLOSING DOWN THE SENATE WONT HELP DEMOCRATS 

7 Apr 21, 2005: SENATE MOVES CLOSER TO NUCLEAR OPTION WTTH COMMriTEE APPROVAL OF BUSH 
JUDICIAL NOMINEES 

8 CiMay 17, 2005: SENATE MODERATES SEEK FILIBUSTER COMPROMISE AFTER LEADERS FAILED 

9 oMay 24, 2DOs: AS BAITLE APPROACHED, BOTH SIDES DUG IN 

10 □ Jun 07, 2005: SENATE SET FOR BROWN CONFIRMATION VOTE 

Please rate the quality of the ^al timeline on a scale of 1-5: 

* L means that the articles are totally unrelated. 

* 5 means that the articles teU a smgle clear stoiy, 

00000 
12345 



Figure 30: A screenshot of the Mechanical Turk task presented to annotators. 



human judgment. We asked Turkers to read the headhnes and first few sentences of each 
article in a timeline and then rate the overall narrative coherence of the timeline on a scale 
of 1 ("the articles are totally unrelated") to 5 ("the articles tell a single clear story"). Five 
separate Turkers rated each timeline. The average ratings are shown in the left column of 



Table [TT] the fc-SDPP timelines are rated as significantly more coherent, while fc-means does 
poorly since it has no way to ensure that clusters are similar between time slices. 

In addition, we asked Turkers to evaluate threads implicitly by performing a second 
task. (This also had the side benefit of ensuring that Turkers were engaged in the rating 
task and did not enter random decisions.) We displayed timelines into which two additional 
"interloper" articles selected at random had been inserted, and asked users to remove the 
two articles that they thought should be removed to improve the flow of the timeline. A 



screenshot of the task is provided in Figure 30 . Intuitively, the true interlopers should be 



selected more often when the original timeline is coherent. The average number of interloper 



articles correctly identified is shown in the right column of Table 11 



Runtime Finally, assuming that tf-idf and feature values have been computed in advance 



(this process requires approximately 160 seconds), we report in Table 12 the time required 
to produce a set of threads on the development set. This measurement includes clustering 
for the fc-means baseline, model fitting for the DTM baseline, and random projections, 
computation of the covariance matrix, and sampling for the fc-SDFF. The tests were run 
on a machine with eight Intel Xeon E5450 cores and 32G of memory. Thanks to the use of 
random projections, the fc-SDFF is not only the most faithful to human news summaries, 
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System Runtime (s) 



fc-means 

DTM 

fc-SDPP 



625.63 
19,433.80 
252.38 



Table 12: Running time for the tested methods. 



but also the fastest by a large margin. 

7 Conclusion 

We have presented a variety of intuitions, algorithms, extensions, and theoretical results 
for determinantal point processes, showing how these models can be made practical for 
real-world modeling and learning. We gave a decomposition that emphasizes a fundamental 
tradeoff between sets of high-quality items and sets that are diverse, and discussed how the 
two models interact. We introduced a dual representation for DPPs that dramatically speeds 
inference for large data sets, and proved that random projections can be used to reduce 
the computational demands of inference even further without sacrificing model fidelity. We 
showed that the quahty model can be learned efficiently for conditional DPPs, and apphed the 
resulting techniques to perform multi-document summarization. We proposed conditioning a 
DPP on its cardinality, leading to fc-DPPs, which offer increased expressiveness and control. 
We then showed how these advantages can be used to improve the fraction of users satisfied 
by image search results. We considered DPPs defined over exponentially-large structured 
sets, and proposed a novel factorization that makes inference in this setting tractable 
through the use of second-order message passing. We then validated the performance of the 
structured model on a multiple pose identification task, improving over heuristic techniques 
for diversification. Finally, we applied these methods to perform complex threading of two 
large document corpora, extracting diverse research threads from academic citation data 
and automatically building news timelines from the Gigaword corpus. 
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